* Wrap ck host utitlies in CK namespace. The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace. Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace. There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library. * Add using declarations to test code. After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace. * Add using declarations to client examples.
Binary Elementwise Operations with Broadcasting
This example demonstrates a generic binary elementwise operation, a fundamental building block in numerical computing. It covers two important cases:
- Simple Elementwise: Applying a binary function to two input tensors of the same shape.
- Elementwise with Broadcasting: Applying a binary function to two input tensors of different but compatible shapes.
Broadcasting defines a set of rules for applying elementwise operations on tensors of different sizes, and it is a cornerstone of libraries like NumPy and TensorFlow.
Mathematical Formulation
Simple Elementwise
Given two input tensors, A and B, of the same rank and dimensions, and a binary operator \odot, the operation computes an output tensor C where each element is:
C_{i,j,k,\dots} = A_{i,j,k,\dots} \odot B_{i,j,k,\dots}
Elementwise with Broadcasting
Broadcasting allows elementwise operations on tensors with different shapes, provided they are compatible. Two dimensions are compatible if they are equal, or if one of them is 1. The operation implicitly "stretches" or "duplicates" the tensor with the dimension of size 1 to match the other tensor's shape.
For example, adding a bias vector B of shape (1, N) to a matrix A of shape (M, N):
C_{i,j} = A_{i,j} + B_{0,j}
Here, the single row of B is broadcast across all M rows of A. The output tensor C has the shape (M, N).
Common binary elementwise operations include addition, subtraction, multiplication (Hadamard product), division, max, and min.
Algorithmic Strategy: Grid-Stride Loop with Broadcasting
The implementation for both cases relies on the efficient grid-stride loop, which is adapted to handle broadcasting.
-
Grid Partitioning: The problem is mapped to a 1D grid of threads based on the number of elements in the output tensor.
-
Grid-Stride Loop: Each thread iterates through a subset of the output elements. For each output index, it must calculate the corresponding indices into the input tensors A and B.
-
Broadcasting Logic:
- The core of the broadcasting logic lies in the
get_broadcast_coordfunction. If an input tensor's dimension is 1, the coordinate for that dimension is always set to 0, effectively reusing the same element across the broadcast dimension. If the dimension matches the output, the coordinate is passed through. - This strategy ensures that memory accesses to the larger tensor remain coalesced, while accesses to the smaller, broadcasted tensor will naturally involve re-reading the same values, which is efficiently handled by the GPU's cache hierarchy.
Like the simple case, broadcasted elementwise operations are almost always memory-bandwidth-bound.
Source Code Organization
This example contains multiple files to demonstrate different scenarios:
binary_elementwise_xdl.cpp: Demonstrates the simple case where both input tensors have the same shape.broadcast_add_2d_amn_bn.cpp: A specific example of broadcasting, adding a tensor of shape(B, N)to a tensor of shape(A, M, N).../../include/ck/tensor_operation/gpu/device/device_elementwise.hpp: The high-level device interface. It is generic enough to handle both simple and broadcasted operations by correctly interpreting the tensor descriptors, which contain shape and stride information.../../include/ck/tensor_operation/gpu/grid/gridwise_elementwise.hpp: The grid-wise kernel that implements the grid-stride loop. The tensor coordinate logic within this kernel correctly handles broadcasting based on the provided tensor descriptors.../../include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp: Defines the various binary operator functors (likeAdd,Multiply, etc.).
Build and Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build the Example
cd /path/to/composable_kernel/example/19_binary_elementwise
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the simple elementwise example
./binary_elementwise_xdl 1 2 1
# Run the broadcasting example
./broadcast_add_2d_amn_bn 1 2 1
Applications
Broadcasting is a powerful feature that makes code more concise and memory-efficient.
- Adding Bias: The most common use case in deep learning is adding a bias vector (shape
[N]) to a matrix of activations (shape[Batch, N]). - Feature Scaling: Multiplying a feature map (shape
[N, C, H, W]) by a per-channel scaling factor (shape[1, C, 1, 1]). - Standardization: In data preprocessing, subtracting the mean (a vector) and dividing by the standard deviation (another vector) from a data matrix.
- Coordinate Grids: Creating coordinate grids by adding a row vector
[0, 1, 2...]to a column vector[0, 1, 2...]^T.