Single-threaded Version#
Element-wise addition:
Triton Implementation#
In Triton, the vector addition kernel achieves efficient vector addition operations by dividing the vectors into multiple blocks and performing parallel computations in the threads of each Grid. Each thread is responsible for loading the corresponding elements from the two vectors, adding them, and storing the result.
Core Steps#
- Thread Parallel Computation: Each thread in the Grid independently processes a portion of the elements in the vector.
- Load Elements: Each thread loads the corresponding elements from vector A and vector B.
- Element Addition: The loaded elements are added together.
- Store Results: The summed results are stored in the output vector.
Usage of tl.constexpr
#
tl.constexpr
is used to declare compile-time constants. This means that the value of variables with this modifier is determined at compile time rather than at runtime. The compiler can perform more aggressive optimizations based on these constant values to enhance the execution efficiency of the kernel.
@triton.jit
def kernel_vector_addition(a_ptr, b_ptr, out_ptr,
num_elems: tl.constexpr,
block_size: tl.constexpr):
# Kernel code
In the above code, num_elems
and block_size
are declared as compile-time constants, allowing Triton to optimize the kernel code during the compilation phase.
Determining Current Block and Program ID#
Each thread block in Triton has a unique Program ID that identifies the block in which the current thread resides. By using tl.program_id
, we can determine the block in which the current thread is located, allowing us to calculate the data offset being processed.
pid = tl.program_id(axis=0)
block_start = pid * block_size
Handling the Last Block#
Since the vector length may not be divisible by the block size, the last block may only require a portion of the threads to work. By using masking operations, we can ensure that only valid threads perform calculations, avoiding invalid memory accesses and computations.
Role of the Mask#
Triton provides masking operations to shield those threads that do not need to work (the NA threads in the last Grid).
mask = thread_offsets < num_elems
a_pointers = tl.load(a_ptr + thread_offsets, mask=mask, other=0.0)
b_pointers = tl.load(b_ptr + thread_offsets, mask=mask, other=0.0)
Role of the ceil_div
Function#
The ceil_div
function is used to calculate the number of blocks, ensuring that all elements are covered even if the vector length is not divisible by the block size. For example, with vec_size=10
and block_size=3
, ceil_div(10, 3)=4
, ensuring that all 10 elements are processed.
def ceil_div(x: int, y: int) -> int:
return (x + y - 1) // y
In simple terms, the function efficiently implements "rounding up."
Numerical Precision Verification#
After implementing the vector addition kernel, verifying numerical precision is a key step to ensure the correctness of the kernel. By comparing it with PyTorch's built-in addition operation, we can confirm the accuracy of the Triton implementation.
def verify_numerics() -> bool:
torch.manual_seed(2020) # seed both cpu and gpu
vec_size = 8192
a = torch.rand(vec_size, device='cuda')
b = torch.rand_like(a)
torch_res = a + b
triton_res = vector_addition(a, b)
fidelity_correct = torch.allclose(torch_res, triton_res)
print(f"{fidelity_correct=}")
return fidelity_correct
Verification shows that our Triton implementation is consistent with PyTorch's native numerical precision, allowing us to proceed with further operations.
Here is the complete Kernel implementation:
@triton.jit
def kernel_vector_addition(a_ptr, b_ptr, out_ptr,
num_elems: tl.constexpr,
block_size: tl.constexpr,):
pid = tl.program_id(axis=0)
# tl.device_print("pid", pid)
block_start = pid * block_size # 0 * 2 = 0, 1 * 2 = 2,
thread_offsets = block_start + tl.arange(0, block_size)
mask = thread_offsets < num_elems
a_pointers = tl.load(a_ptr + thread_offsets, mask=mask)
b_pointers = tl.load(b_ptr + thread_offsets, mask=mask)
res = a_pointers + b_pointers
tl.store(out_ptr + thread_offsets, res, mask=mask)
def ceil_div(x: int,y: int) -> int:
return (x + y - 1) // y
def vector_addition(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
output_buffer = torch.empty_like(a)
assert a.is_cuda and b.is_cuda
num_elems = a.numel()
assert num_elems == b.num_elems() # todo - handle mismatched sizes
block_size = 1024
grid_size = ceil_div(num_elems, block_size)
grid = (grid_size,)
num_warps = 8
k2 = kernel_vector_addition[grid](a, b, output_buffer,
num_elems,
block_size,
num_warps=num_warps
)
return output_buffer
Benchmarking and Performance Tuning#
To evaluate the performance of the Triton vector addition kernel, we will conduct benchmarking and discuss methods for performance tuning.
Introduction to Benchmark API#
Triton provides a rich set of benchmarking APIs that allow users to measure the execution time and throughput of kernels. The following code is an example of obtaining a performance report using triton.testing.perf_report
:
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['size'], # Parameter name for the x-axis of the chart
x_vals=[2**i for i in range(10, 28)], # Possible values for `x_name`
x_log=True, # Use logarithmic scale for the x-axis
line_arg='provider', # Parameter name corresponding to different lines in the chart
line_vals=['triton', 'torch'], # Possible values for `line_arg`
line_names=["Triton", "Torch"], # Labels for the lines
styles=[('blue', '-'), ('green', '-')], # Line colors and styles
ylabel='GB/s', # y-axis label
plot_name='vector-add-performance', # Chart name, also used as the filename for saving
args={}, # Values for function parameters not in `x_names` and `y_name`
)
)
def benchmark(size, provider):
x = torch.rand(size, device='cuda', dtype=torch.float32)
y = torch.rand(size, device='cuda', dtype=torch.float32)
quantiles = [0.5, 0.2, 0.8] # Set quantiles
# Choose different computation implementations based on provider
if provider == 'torch':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y, quantiles=quantiles)
if provider == 'triton':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: vector_addition(x, y), quantiles=quantiles)
# Calculate GB/s
def gbps(ms):
return 12 * size / ms * 1e-06
# Return GB/s corresponding to median, maximum, and minimum values
return gbps(ms), gbps(max_ms), gbps(min_ms)
Performance report:
Performance comparison:
Overall, vector addition is a relatively simple kernel, and it is harder to gain advantages from Triton implementations compared to more complex kernels (most commonly used operations in PyTorch have already been optimized to a high degree through CUDA/cuBLAS, etc.).
Tuning Parameters: Num Warps & Block Size#
The key to optimizing kernel performance lies in appropriately configuring the number of warps and block size. A warp is the basic execution unit in a GPU, and a reasonable number of warps and block size can fully utilize the parallel computing capabilities of the GPU, enhancing the execution efficiency of the kernel.
block_size = 1024 # Determines the number of elements processed by each thread block; a larger block size can reduce the number of blocks but may increase the computational burden of each block.
grid_size = ceil_div(num_elems, block_size)
grid = (grid_size,)
num_warps = 8 # The number of warps included in each block; reasonable configuration of warp numbers can optimize thread scheduling and resource utilization.
The previous section (Softmax in OpenAI Triton) provided a way to dynamically adjust parameters through the driver:
# Calculate block_size, the smallest power of 2 greater than or equal to cols
block_size = triton.next_power_of_2(cols)
# Dynamically adjust num_warps based on block_size
num_warps = 4 # Each warp has 32 threads
if block_size > 2047:
num_warps = 8
if block_size > 4095:
num_warps = 16
References#
Thanks to:
- The content of this article is primarily based on this teacher's video tutorial series SOTA Deep Learning Tutorials - YouTube
- Triton’s documentation
- o1-preview language model, an expert in drawing SVGs.