CUDA Grid, Block, and Thread Dimensions and Notes

December 17, 2022 · 2 min read

Thread: A single execution unit.
Block: A group of threads (arranged in 3D).
Grid: A group of blocks (arranged in 3D).

There are limits to the number of threads, blocks, and grids, which can be obtained using the deviceQuery command.

On RTX3080 with CUDA 11.8, the following were:

Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size   (x,y,z): (2147483647, 65535, 65535)

In this case:

The maximum number of threads per block is 1024.
The block shape must fit within (1024, 1024, 64). That is, if the block shape is (a, b, c), then the following conditions must be met: $abc \leq 1024$ and $a \leq 1024$ and $b \leq 1024$ and $c \leq 64$.
The grid shape must fit within (2147483647, 65535, 65535). That is, if the grid shape is (d, e, f), then the following conditions must be met: $def \leq 2147483647 \times 65535 \times 65535$ and $d \leq 2147483647$ and $e \leq 65535$ and $f \leq 65535$.

The type used to define grid and block shapes is dim3.

Example Usage

#include <stdio.h>

__global__ void func1(){
   printf("%d, %d, %d\n", threadIdx.x, threadIdx.y, threadIdx.z);
}

__global__ void func2(){
   int i = threadIdx.x + blockDim.x * threadIdx.y + blockDim.x * blockDim.y * threadIdx.z;
   printf("%d\n", i);
}

int main(){
   dim3 grid(1, 1, 1);
   dim3 block(4, 8, 32);

   func1<<<grid, block>>>();
   func2<<<grid, block>>>();
   cudaDeviceSynchronize();
}