From 7785774fcb4fea71a4c5c0163d830d09837e4dd4 Mon Sep 17 00:00:00 2001 From: Vidyasagar Date: Thu, 2 Oct 2025 11:33:29 -0700 Subject: [PATCH] Updates based on PR feedback 4 --- ACRONYMS.md | 5 ++- example/19_binary_elementwise/README.md | 12 +++---- .../28_grouped_gemm_bias_e_permute/README.md | 8 ++--- .../29_batched_gemm_bias_e_permute/README.md | 10 +++--- .../30_grouped_conv_fwd_multiple_d/README.md | 16 ++++----- .../README.md | 12 +++---- .../README.md | 10 +++--- example/42_groupnorm_fwd/README.md | 2 +- .../43_splitk_gemm_bias_e_permute/README.md | 6 ++-- .../45_elementwise_normalization/README.md | 34 +++++++++---------- example/62_convnd_activ/README.md | 4 +-- example/65_gemm_multiply_multiply/README.md | 10 +++--- .../66_complex_contraction_bilinear/README.md | 2 +- example/ck_tile/17_grouped_gemm/README.md | 4 +-- 14 files changed, 67 insertions(+), 68 deletions(-) diff --git a/ACRONYMS.md b/ACRONYMS.md index 2e44ef020a..ed81b30751 100644 --- a/ACRONYMS.md +++ b/ACRONYMS.md @@ -4,9 +4,8 @@ The following acronyms are used in the Composable Kernel codebase: | Acronym | Expansion | Explanation | |---------|-----------|-------------| -| B2B | Back-to-Back | Fused or sequential kernel operations (e.g., B2B GEMM) | -| BF16 | Brain Floating Point 16 | 1 Signed bit, 8 Exponent bits, 7 Mantissa bits | -| BF8 | 8-bit Brain Floating Point | 1 Signed bit, 3 Exponent bits, 4 Mantissa bits | +| BF16 | Brain Floating Point 16 | 1 Signed bit, 8 Exponent bits, 7 Significand bits | +| BF8 | 8-bit Brain Floating Point | 1 Signed bit, 3 Exponent bits, 4 Significand bits | | DLA | Deep Learning Accelerator | Specialized hardware for deep learning workloads | | DRAM | Dynamic Random-Access Memory | Main memory. Global memory on GPU | | E2E | End-to-End | Complete pipeline or process from input to output | diff --git a/example/19_binary_elementwise/README.md b/example/19_binary_elementwise/README.md index bccd88dd25..14910673fc 100644 --- a/example/19_binary_elementwise/README.md +++ b/example/19_binary_elementwise/README.md @@ -1,10 +1,10 @@ # Binary Elementwise Operations with Broadcasting -This example demonstrates a generic binary element-wise operation, a fundamental building block in numerical computing. It covers two important cases: +This example demonstrates a generic binary elementwise operation, a fundamental building block in numerical computing. It covers two important cases: 1. **Simple Elementwise**: Applying a binary function to two input tensors of the *same* shape. 2. **Elementwise with Broadcasting**: Applying a binary function to two input tensors of *different but compatible* shapes. -Broadcasting defines a set of rules for applying element-wise operations on tensors of different sizes, and it is a cornerstone of libraries like NumPy and TensorFlow. +Broadcasting defines a set of rules for applying elementwise operations on tensors of different sizes, and it is a cornerstone of libraries like NumPy and TensorFlow. ## Mathematical Formulation @@ -14,14 +14,14 @@ Given two input tensors, A and B, of the same rank and dimensions, and a binary $C_{i,j,k,\dots} = A_{i,j,k,\dots} \odot B_{i,j,k,\dots}$ ### Elementwise with Broadcasting -Broadcasting allows element-wise operations on tensors with different shapes, provided they are compatible. Two dimensions are compatible if they are equal, or if one of them is 1. The operation implicitly "stretches" or "duplicates" the tensor with the dimension of size 1 to match the other tensor's shape. +Broadcasting allows elementwise operations on tensors with different shapes, provided they are compatible. Two dimensions are compatible if they are equal, or if one of them is 1. The operation implicitly "stretches" or "duplicates" the tensor with the dimension of size 1 to match the other tensor's shape. For example, adding a bias vector `B` of shape `(1, N)` to a matrix `A` of shape `(M, N)`: $C_{i,j} = A_{i,j} + B_{0,j}$ Here, the single row of `B` is broadcast across all `M` rows of `A`. The output tensor `C` has the shape `(M, N)`. -Common binary element-wise operations include addition, subtraction, multiplication (Hadamard product), division, max, and min. +Common binary elementwise operations include addition, subtraction, multiplication (Hadamard product), division, max, and min. ## Algorithmic Strategy: Grid-Stride Loop with Broadcasting @@ -50,7 +50,7 @@ The implementation for both cases relies on the efficient **grid-stride loop**, - The core of the broadcasting logic lies in the `get_broadcast_coord` function. If an input tensor's dimension is 1, the coordinate for that dimension is always set to 0, effectively reusing the same element across the broadcast dimension. If the dimension matches the output, the coordinate is passed through. - This strategy ensures that memory accesses to the larger tensor remain coalesced, while accesses to the smaller, broadcasted tensor will naturally involve re-reading the same values, which is efficiently handled by the GPU's cache hierarchy. -Like the simple case, broadcasted element-wise operations are almost always memory-bandwidth-bound. +Like the simple case, broadcasted elementwise operations are almost always memory-bandwidth-bound. ## Source Code Organization @@ -83,7 +83,7 @@ make -j ### Run the Example ```bash -# Run the simple element-wise example +# Run the simple elementwise example ./binary_elementwise_xdl 1 2 1 # Run the broadcasting example diff --git a/example/28_grouped_gemm_bias_e_permute/README.md b/example/28_grouped_gemm_bias_e_permute/README.md index 1fb4c70a66..8cb3ceff95 100644 --- a/example/28_grouped_gemm_bias_e_permute/README.md +++ b/example/28_grouped_gemm_bias_e_permute/README.md @@ -1,6 +1,6 @@ # Grouped GEMM with Bias, Elementwise Operation, and Permutation -This example demonstrates a highly complex and specialized fusion: a **Grouped GEMM** where each individual GEMM operation is fused with a bias addition, a second element-wise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a group-parallel structure, such as depthwise separable convolutions or multi-head attention, when they are part of a larger fused computational graph. +This example demonstrates a highly complex and specialized fusion: a **Grouped GEMM** where each individual GEMM operation is fused with a bias addition, a second elementwise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a group-parallel structure, such as depthwise separable convolutions or multi-head attention, when they are part of a larger fused computational graph. ## Mathematical Formulation @@ -12,7 +12,7 @@ This operation performs `G` independent fused GEMM operations in parallel, where 2. **Bias Addition Stage**: A bias vector `D_[g]` is broadcast and added. $C_{temp2[g]} = C_{temp1[g]} + D_{[g]}$ -3. **Elementwise Stage**: A second element-wise operation is performed with tensor `E_[g]`. +3. **Elementwise Stage**: A second elementwise operation is performed with tensor `E_[g]`. $C_{temp3[g]} = C_{temp2[g]} \odot E_{[g]}$ 4. **Permutation Stage**: The final result for the group is permuted. @@ -31,7 +31,7 @@ The implementation combines the scheduling strategy of Grouped GEMM with the mul - Executing a standard tiled GEMM for $A_{[g]} \times B_{[g]}$, accumulating the result in registers. - Executing the fused epilogue: - Load the bias `D_[g]` and add it. - - Load the element-wise tensor `E_[g]` and apply the operation. + - Load the elementwise tensor `E_[g]` and apply the operation. - Calculate the permuted destination coordinates and write the final result to `F_[g]`. This approach maximizes parallelism at two levels: the coarse-grained parallelism across the `G` groups, and the fine-grained data parallelism within each individual GEMM operation. @@ -77,7 +77,7 @@ make -j This highly specialized kernel is valuable for optimizing specific patterns in modern neural networks: -- **Multi-Head Attention (MHA)**: The computation for each head in MHA is independent. The entire MHA block can be viewed as a Grouped GEMM where the number of groups `G` is the number of attention heads. If the Q, K, or V projections involve fusions with bias, other element-wise ops, and permutations to prepare the data for the batched GEMM, this kernel could potentially fuse a large part of that logic. +- **Multi-Head Attention (MHA)**: The computation for each head in MHA is independent. The entire MHA block can be viewed as a Grouped GEMM where the number of groups `G` is the number of attention heads. If the Q, K, or V projections involve fusions with bias, other elementwise ops, and permutations to prepare the data for the batched GEMM, this kernel could potentially fuse a large part of that logic. - **Depthwise Separable Convolutions**: The depthwise part of this convolution is a Grouped GEMM with `G` equal to the number of channels. If this is followed by a fused activation function (e.g., a gated activation) and a permutation, this kernel could be a perfect match. - **Mixture-of-Experts (MoE) Models**: In MoE layers, an input is routed to one of several "expert" sub-networks. If these experts have identical structure, their execution can be formulated as a Grouped GEMM, where `G` is the number of experts. Any fusions within the expert network could be captured by this kernel. diff --git a/example/29_batched_gemm_bias_e_permute/README.md b/example/29_batched_gemm_bias_e_permute/README.md index 3b57f6ac59..7b340d9203 100644 --- a/example/29_batched_gemm_bias_e_permute/README.md +++ b/example/29_batched_gemm_bias_e_permute/README.md @@ -1,6 +1,6 @@ # Batched GEMM with Bias, Elementwise Operation, and Permutation -This example demonstrates a **Batched GEMM** where each individual GEMM operation is fused with a bias addition, a second element-wise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a batch-parallel structure, such as the dense layers in a Transformer's feed-forward network, when they are part of a larger fused computational graph. +This example demonstrates a **Batched GEMM** where each individual GEMM operation is fused with a bias addition, a second elementwise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a batch-parallel structure, such as the dense layers in a Transformer's feed-forward network, when they are part of a larger fused computational graph. ## Mathematical Formulation @@ -12,7 +12,7 @@ This operation performs `B` independent fused GEMM operations in parallel, where 2. **Bias Addition Stage**: A bias vector `D_[b]` is broadcast and added. $C_{temp2[b]} = C_{temp1[b]} + D_{[b]}$ -3. **Elementwise Stage**: A second element-wise operation is performed with tensor `E_[b]`. +3. **Elementwise Stage**: A second elementwise operation is performed with tensor `E_[b]`. $C_{temp3[b]} = C_{temp2[b]} \odot E_{[b]}$ 4. **Permutation Stage**: The final result for the batch item is permuted. @@ -35,7 +35,7 @@ The implementation combines the scheduling strategy of Batched GEMM with the mul - Executing a standard tiled GEMM for $A_{[b]} \times B_{[b]}$, accumulating the result in registers. - Executing the fused epilogue: - Load the bias `D_[b]` and add it. - - Load the element-wise tensor `E_[b]` and apply the operation. + - Load the elementwise tensor `E_[b]` and apply the operation. - Calculate the permuted destination coordinates and write the final result to `F_{[b]`. This approach is extremely efficient when the batch size `B` is large enough to saturate the GPU's parallelism. @@ -85,7 +85,7 @@ A typical Transformer FFN block is: `FFN(X) = Linear_2(ReLU(Linear_1(X)))` - `Linear_1` is a GEMM. -- `ReLU` is an element-wise activation. +- `ReLU` is an elementwise activation. - `Linear_2` is another GEMM. -Sometimes, for performance reasons (e.g., to align with a subsequent layer's expected input layout), the output of the FFN needs to be permuted. This kernel could fuse the `Linear_2` GEMM with its bias, a subsequent element-wise operation (if any), and the final permutation, all while operating on a batch of input sequences. This avoids multiple kernel launches and saves significant memory bandwidth, leading to faster model execution. +Sometimes, for performance reasons (e.g., to align with a subsequent layer's expected input layout), the output of the FFN needs to be permuted. This kernel could fuse the `Linear_2` GEMM with its bias, a subsequent elementwise operation (if any), and the final permutation, all while operating on a batch of input sequences. This avoids multiple kernel launches and saves significant memory bandwidth, leading to faster model execution. diff --git a/example/30_grouped_conv_fwd_multiple_d/README.md b/example/30_grouped_conv_fwd_multiple_d/README.md index 7a5d06e8d8..4ad3edd5db 100644 --- a/example/30_grouped_conv_fwd_multiple_d/README.md +++ b/example/30_grouped_conv_fwd_multiple_d/README.md @@ -1,6 +1,6 @@ # Grouped Convolution Forward with Multiple Elementwise Inputs -This example demonstrates a **Grouped Convolution Forward Pass** fused with an element-wise operation that takes multiple auxiliary input tensors (`D` tensors). This is a powerful fusion that combines the parallel structure of grouped convolutions with the ability to merge subsequent element-wise layers, such as custom activations or residual connections, into a single kernel. +This example demonstrates a **Grouped Convolution Forward Pass** fused with an elementwise operation that takes multiple auxiliary input tensors (`D` tensors). This is a powerful fusion that combines the parallel structure of grouped convolutions with the ability to merge subsequent elementwise layers, such as custom activations or residual connections, into a single kernel. ## Mathematical Formulation @@ -9,10 +9,10 @@ This operation performs `G` independent fused convolution operations in parallel 1. **Convolution Stage**: A standard N-dimensional forward convolution is performed for the group. $C_{out[g]} = \text{Conv}(\text{In}_{[g]}, \text{W}_{[g]})$ -2. **Elementwise Stage**: The result of the convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined element-wise function `f`. +2. **Elementwise Stage**: The result of the convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined elementwise function `f`. $E_{[g]} = f(C_{out[g]}, D_{0[g]}, D_{1[g]}, \dots)$ -The key optimization is that the intermediate convolution result, $C_{out[g]}$, is never written to global memory. It is computed and held in registers, then immediately consumed by the element-wise part of the kernel's epilogue before the final result `E` is stored. +The key optimization is that the intermediate convolution result, $C_{out[g]}$, is never written to global memory. It is computed and held in registers, then immediately consumed by the elementwise part of the kernel's epilogue before the final result `E` is stored. ## Algorithmic Strategy: Implicit Grouped GEMM with Fused Multi-D Epilogue @@ -24,9 +24,9 @@ The implementation combines three core concepts: the implicit GEMM transformatio - Calculating the base memory addresses for the group's input tensors: $\text{In}_{[g]}, \text{W}_{[g]}, D_{0[g]}, \dots, E_{[g]}$. - Performing a tiled GEMM, where tiles of the input `In` and weights `W` are read (with the `im2col` transformation happening on-the-fly) and the result is accumulated in registers. -3. **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the element-wise fusion: +3. **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the elementwise fusion: - Threads load the corresponding tiles from the auxiliary `D` tensors for the assigned group. - - The user-defined element-wise function `f` is applied in registers to the convolution result and the `D` tensor values. + - The user-defined elementwise function `f` is applied in registers to the convolution result and the `D` tensor values. - The final result `E` for the group is written to global memory. This strategy is highly efficient as it minimizes memory bandwidth by avoiding the materialization of the intermediate convolution output and maximizes parallelism by executing all groups concurrently. @@ -82,6 +82,6 @@ Following arguments (depending on number of spatial dims): This kernel is ideal for optimizing layers in modern CNNs that use grouped convolutions followed by complex activations or residual connections. -- **Fused Residual Connections**: A common pattern is `Conv(x) + x`. This can be implemented by passing the input `x` as a `D` tensor and defining the element-wise function as `f(conv_out, d0) = conv_out + d0`. If this is a grouped convolution, this kernel is a perfect fit. -- **Custom Gated Activations**: Some architectures use gated activations, such as `Conv_A(x) * sigmoid(Conv_B(x))`. While this kernel doesn't compute two convolutions, it can fuse one convolution with an element-wise multiplication against another tensor. For example, it could compute `Conv_A(x) * D0`, where `D0` is the pre-computed `sigmoid(Conv_B(x))`. -- **Depthwise Separable Convolutions**: These layers consist of a depthwise convolution (a grouped convolution with `G = C`) followed by a pointwise convolution (`1x1` conv). If there is a residual connection or other element-wise operation after the depthwise stage, this kernel can fuse it directly, improving the performance of this widely used building block. +- **Fused Residual Connections**: A common pattern is `Conv(x) + x`. This can be implemented by passing the input `x` as a `D` tensor and defining the elementwise function as `f(conv_out, d0) = conv_out + d0`. If this is a grouped convolution, this kernel is a perfect fit. +- **Custom Gated Activations**: Some architectures use gated activations, such as `Conv_A(x) * sigmoid(Conv_B(x))`. While this kernel doesn't compute two convolutions, it can fuse one convolution with an elementwise multiplication against another tensor. For example, it could compute `Conv_A(x) * D0`, where `D0` is the pre-computed `sigmoid(Conv_B(x))`. +- **Depthwise Separable Convolutions**: These layers consist of a depthwise convolution (a grouped convolution with `G = C`) followed by a pointwise convolution (`1x1` conv). If there is a residual connection or other elementwise operation after the depthwise stage, this kernel can fuse it directly, improving the performance of this widely used building block. diff --git a/example/37_batched_gemm_add_add_relu_gemm_add/README.md b/example/37_batched_gemm_add_add_relu_gemm_add/README.md index 490c06758e..3aeeb67321 100644 --- a/example/37_batched_gemm_add_add_relu_gemm_add/README.md +++ b/example/37_batched_gemm_add_add_relu_gemm_add/README.md @@ -1,6 +1,6 @@ # Fused Batched GEMM-Add-Add-ReLU-GEMM-Add -This example demonstrates an exceptionally deep and complex fusion, chaining two GEMMs with multiple element-wise additions and a ReLU activation. This pattern is designed to fuse a significant portion of a residual block, such as the feed-forward network (FFN) in a Transformer, into a single, highly optimized kernel. +This example demonstrates an exceptionally deep and complex fusion, chaining two GEMMs with multiple elementwise additions and a ReLU activation. This pattern is designed to fuse a significant portion of a residual block, such as the feed-forward network (FFN) in a Transformer, into a single, highly optimized kernel. ## Mathematical Formulation @@ -9,10 +9,10 @@ The operation computes a complex chain of operations, batched `B` times. For eac 1. **First GEMM (GEMM0)**: $C_{temp1[b]} = A_{[b]} \times B_{[b]}$ -2. **First Add (Add0)**: An element-wise addition with tensor `D0`. +2. **First Add (Add0)**: An elementwise addition with tensor `D0`. $C_{temp2[b]} = C_{temp1[b]} + D0_{[b]}$ -3. **Second Add (Add1)**: Another element-wise addition with tensor `D1`. +3. **Second Add (Add1)**: Another elementwise addition with tensor `D1`. $C_{temp3[b]} = C_{temp2[b]} + D1_{[b]}$ 4. **Activation (ReLU)**: A Rectified Linear Unit activation is applied. @@ -21,14 +21,14 @@ The operation computes a complex chain of operations, batched `B` times. For eac 5. **Second GEMM (GEMM1)**: The result is fed into a second GEMM. $E_{temp[b]} = C_{temp4[b]} \times C_{[b]}$ -6. **Third Add (Add2)**: A final element-wise addition with tensor `D2`. +6. **Third Add (Add2)**: A final elementwise addition with tensor `D2`. $E_{[b]} = E_{temp[b]} + D2_{[b]}$ The key optimization is that all intermediate tensors ($C_{temp1}$ through $E_{temp}$) are **never written to global memory**. They are produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory). ## Algorithmic Strategy: Deeply Fused Producer-Consumer Chain -This kernel represents a pinnacle of fusion capability. It chains two "producer-consumer" GEMMs together, with a series of element-wise operations fused into the epilogue of the first GEMM. +This kernel represents a pinnacle of fusion capability. It chains two "producer-consumer" GEMMs together, with a series of elementwise operations fused into the epilogue of the first GEMM. 1. **Batch Scheduling**: The `B` independent problems are distributed across the GPU's thread blocks. Each thread block is assigned to compute the full chain for one batch item `b`. @@ -51,7 +51,7 @@ This deep fusion avoids five separate kernel launches and the associated read/wr ## Source Code Organization - [`batched_gemm_add_add_relu_gemm_add_xdl.cpp`](./batched_gemm_add_add_relu_gemm_add_xdl.cpp): The main example file. It sets up the numerous input tensors (A, B, C, D0, D1, D2) and instantiates the highly specialized device-level operation. -- The device-level interface and underlying grid-wise kernel for this operation are extremely complex, templated on the multiple element-wise operations and managing the intricate data flow between registers, shared memory, and global memory. +- The device-level interface and underlying grid-wise kernel for this operation are extremely complex, templated on the multiple elementwise operations and managing the intricate data flow between registers, shared memory, and global memory. ## Build and Run diff --git a/example/38_grouped_conv_bwd_data_multiple_d/README.md b/example/38_grouped_conv_bwd_data_multiple_d/README.md index 6eb336d2d2..71f9fd7f76 100644 --- a/example/38_grouped_conv_bwd_data_multiple_d/README.md +++ b/example/38_grouped_conv_bwd_data_multiple_d/README.md @@ -1,6 +1,6 @@ # Grouped Convolution Backward Data with Multiple Elementwise Inputs -This example demonstrates a **Grouped Convolution Backward Data Pass** fused with an element-wise operation that takes multiple auxiliary input tensors (`D` tensors). The backward data pass (also known as a transposed convolution or deconvolution) computes the gradient of the loss with respect to the convolution's *input* tensor. Fusing it with other operations is a powerful way to optimize the backward pass of a neural network. +This example demonstrates a **Grouped Convolution Backward Data Pass** fused with an elementwise operation that takes multiple auxiliary input tensors (`D` tensors). The backward data pass (also known as a transposed convolution or deconvolution) computes the gradient of the loss with respect to the convolution's *input* tensor. Fusing it with other operations is a powerful way to optimize the backward pass of a neural network. ## Mathematical Formulation @@ -10,7 +10,7 @@ The operation computes the gradient with respect to the input (`GradIn`) of a gr $GradIn_{temp[g]} = \text{ConvBwdData}(\text{GradOut}_{[g]}, \text{W}_{[g]})$ Where `GradOut` is the gradient from the subsequent layer and `W` is the weight tensor from the forward pass. -2. **Elementwise Stage**: The result of the backward convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined element-wise function `f`. +2. **Elementwise Stage**: The result of the backward convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined elementwise function `f`. $GradIn_{[g]} = f(GradIn_{temp[g]}, D_{0[g]}, D_{1[g]}, \dots)$ This fusion is particularly useful for operations like adding the gradient from a residual "skip" connection, which is a common pattern in modern network architectures. By fusing the addition, we avoid a separate kernel launch and a full read/write pass of the `GradIn` tensor. @@ -23,9 +23,9 @@ The implementation uses the implicit GEMM algorithm, but configured for the back 2. **Implicit GEMM for Backward Data**: The backward data convolution can be mathematically re-arranged to be equivalent to a forward convolution with transformed inputs and weights, which can then be solved with an implicit GEMM algorithm. Composable Kernel handles this transformation. A thread block executes the implicit GEMM for its assigned group, accumulating the `GradIn_temp` result in registers. -3. **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the element-wise fusion: +3. **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the elementwise fusion: - Threads load the corresponding tiles from the auxiliary `D` tensors for the assigned group. - - The user-defined element-wise function `f` is applied in registers to the computed gradient and the `D` tensor values. + - The user-defined elementwise function `f` is applied in registers to the computed gradient and the `D` tensor values. - The final result `GradIn` for the group is written to global memory. This strategy minimizes memory bandwidth by avoiding the materialization of the intermediate gradient tensor and maximizes parallelism by executing all groups concurrently. @@ -72,4 +72,4 @@ make -j Fusing operations into the backward pass is a critical optimization for training deep neural networks. - **Fused Residual Gradient**: In a residual block (`y = F(x) + x`), the gradient with respect to `x` is `dF/dx + dy/dx`. If `F` is a convolution, `dF/dx` is the output of the `ConvBwdData` operation. The `dy/dx` term (the gradient from the skip connection) can be passed as a `D` tensor and fused via an addition, computing the full gradient for `x` in a single kernel. -- **Fused Gradient Clipping/Scaling**: The `D` tensors and the element-wise function `f` could be used to apply gradient scaling or other custom gradient processing steps directly to the output of the backward convolution, before the result is written back to memory. +- **Fused Gradient Clipping/Scaling**: The `D` tensors and the elementwise function `f` could be used to apply gradient scaling or other custom gradient processing steps directly to the output of the backward convolution, before the result is written back to memory. diff --git a/example/42_groupnorm_fwd/README.md b/example/42_groupnorm_fwd/README.md index feba007e10..3e74a1ecfc 100644 --- a/example/42_groupnorm_fwd/README.md +++ b/example/42_groupnorm_fwd/README.md @@ -49,7 +49,7 @@ Composable Kernel encapsulates this two-pass logic into a single, efficient `Dev - [`groupnorm_fwd_xdl.cpp`](./groupnorm_fwd_xdl.cpp): The main example file. It sets up the input tensor, `gamma` and `beta` vectors, the number of groups, and instantiates the `DeviceGroupnormFwd` operation. - [`../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp): The high-level device interface for the GroupNorm forward pass. -- The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an element-wise kernel to apply the normalization. +- The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization. ## Build and Run diff --git a/example/43_splitk_gemm_bias_e_permute/README.md b/example/43_splitk_gemm_bias_e_permute/README.md index 447638aa45..2f56ba01f8 100644 --- a/example/43_splitk_gemm_bias_e_permute/README.md +++ b/example/43_splitk_gemm_bias_e_permute/README.md @@ -1,6 +1,6 @@ # Split-K GEMM with Bias, Elementwise Operation, and Permutation -This example demonstrates a highly complex fusion: a **Split-K GEMM** where the final result is fused with a bias addition, a second element-wise operation, and a final permutation. This kernel combines the parallelism-enhancing Split-K strategy with a multi-stage epilogue, making it suitable for accelerating very large or "skinny" GEMMs that are part of a more complex computational graph. +This example demonstrates a highly complex fusion: a **Split-K GEMM** where the final result is fused with a bias addition, a second elementwise operation, and a final permutation. This kernel combines the parallelism-enhancing Split-K strategy with a multi-stage epilogue, making it suitable for accelerating very large or "skinny" GEMMs that are part of a more complex computational graph. ## Mathematical Formulation @@ -12,7 +12,7 @@ The operation first computes a GEMM using the Split-K algorithm and then applies 2. **Bias Addition Stage**: A bias vector `D` is broadcast and added. $C_{temp2} = C_{temp1} + D$ -3. **Elementwise Stage**: A second element-wise operation is performed with tensor `E`. +3. **Elementwise Stage**: A second elementwise operation is performed with tensor `E`. $C_{temp3} = C_{temp2} \odot E$ 4. **Permutation Stage**: The final result is permuted. @@ -38,7 +38,7 @@ Composable Kernel's implementation abstracts this complexity, providing a single ## Source Code Organization -- [`splitk_gemm_bias_e_permute_xdl.cpp`](./splitk_gemm_bias_e_permute_xdl.cpp): The main example file. It sets up the GEMM problem, the bias and element-wise tensors, the permutation, and instantiates the `DeviceSplitkGemmBiasEPermute` operation. +- [`splitk_gemm_bias_e_permute_xdl.cpp`](./splitk_gemm_bias_e_permute_xdl.cpp): The main example file. It sets up the GEMM problem, the bias and elementwise tensors, the permutation, and instantiates the `DeviceSplitkGemmBiasEPermute` operation. - The device-level interface and underlying kernels are highly specialized. They manage the Split-K parameter, the workspace allocation (if needed), and the two-stage execution process, combining the logic from `DeviceGemmSplitK` and `DeviceGemmBiasEPermute`. ## Build and Run diff --git a/example/45_elementwise_normalization/README.md b/example/45_elementwise_normalization/README.md index 45ec7ed89a..895ed84ad2 100644 --- a/example/45_elementwise_normalization/README.md +++ b/example/45_elementwise_normalization/README.md @@ -1,14 +1,14 @@ # Elementwise Normalization -This example demonstrates a fused **element-wise operation followed by normalization**. This pattern combines element-wise tensor arithmetic with a normalization operation in a single kernel, which is particularly useful for implementing custom normalization layers or fused activation-normalization blocks. +This example demonstrates a fused **elementwise operation followed by normalization**. This pattern combines elementwise tensor arithmetic with a normalization operation in a single kernel, which is particularly useful for implementing custom normalization layers or fused activation-normalization blocks. ## Mathematical Formulation -The operation performs an element-wise computation followed by a normalization operation. +The operation performs an elementwise computation followed by a normalization operation. -1. **Elementwise Stage**: An element-wise operation is applied to one or more input tensors. +1. **Elementwise Stage**: An elementwise operation is applied to one or more input tensors. $C_{temp} = f(A, B, \dots)$ - Where `f` is a user-defined element-wise function that operates on corresponding elements of the input tensors. + Where `f` is a user-defined elementwise function that operates on corresponding elements of the input tensors. 2. **Normalization Stage**: The result is then normalized. The normalization can be performed along specified dimensions. - **Compute Statistics**: For each normalization group, compute the mean and variance. @@ -19,31 +19,31 @@ The operation performs an element-wise computation followed by a normalization o - **Scale and Shift**: Apply learnable parameters. $D = \gamma \cdot \hat{C} + \beta$ -The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The element-wise computation feeds directly into the normalization calculation. +The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The elementwise computation feeds directly into the normalization calculation. ## Algorithmic Strategy: Fused Elementwise with Online Normalization -The implementation combines element-wise computation with an online normalization algorithm. +The implementation combines elementwise computation with an online normalization algorithm. 1. **Grid Scheduling**: The normalization groups are distributed among thread blocks. Each block handles one or more normalization groups. 2. **Fused Two-Pass Algorithm**: - **Pass 1 - Compute Elementwise and Moments**: - - Threads cooperatively load input tensors and apply the element-wise function `f`. - - The element-wise results are kept in registers/shared memory. - - **Welford's Algorithm**: Threads use Welford's online algorithm to compute the mean and variance of the element-wise results within their normalization group. + - Threads cooperatively load input tensors and apply the elementwise function `f`. + - The elementwise results are kept in registers/shared memory. + - **Welford's Algorithm**: Threads use Welford's online algorithm to compute the mean and variance of the elementwise results within their normalization group. - **Intra-Block Reduction**: A parallel reduction in shared memory computes the final statistics for the group. - **Pass 2 - Normalize and Store**: - - Using the computed statistics, threads apply the normalization formula to their element-wise results. + - Using the computed statistics, threads apply the normalization formula to their elementwise results. - The final normalized result is written to the output tensor `D`. -This approach ensures that the element-wise computation is performed only once, and the results are immediately consumed by the normalization process without requiring additional memory bandwidth. +This approach ensures that the elementwise computation is performed only once, and the results are immediately consumed by the normalization process without requiring additional memory bandwidth. ## Source Code Organization -- [`elementwise_normalization_xdl.cpp`](./elementwise_normalization_xdl.cpp): The main example file. It sets up the input tensors, defines the element-wise operation and normalization parameters, and instantiates the `DeviceElementwiseNormalization` operation. -- [`../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp`](../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp): The high-level device interface for the fused element-wise normalization operation. -- The underlying grid-wise kernel implements the complex fusion of element-wise operations with the two-pass normalization algorithm. +- [`elementwise_normalization_xdl.cpp`](./elementwise_normalization_xdl.cpp): The main example file. It sets up the input tensors, defines the elementwise operation and normalization parameters, and instantiates the `DeviceElementwiseNormalization` operation. +- [`../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp`](../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp): The high-level device interface for the fused elementwise normalization operation. +- The underlying grid-wise kernel implements the complex fusion of elementwise operations with the two-pass normalization algorithm. ## Build and Run @@ -81,6 +81,6 @@ make -j This fused operation is valuable for implementing custom normalization layers and optimizing activation-normalization sequences. - **Custom Activation-Normalization Blocks**: Some architectures use non-standard activation functions followed by normalization. For example, a Swish activation followed by layer normalization can be fused into a single kernel using this pattern. -- **Residual Connection with Normalization**: In some variants of residual networks, the residual addition is immediately followed by normalization. This can be expressed as an element-wise addition (residual) followed by normalization. -- **Preprocessing Pipelines**: In data preprocessing, tensors might need element-wise transformations (e.g., color space conversion) followed by normalization (e.g., standardization). This kernel can fuse these operations. -- **Research Architectures**: Novel normalization techniques often involve custom element-wise operations before the normalization step. This kernel provides a flexible foundation for implementing such research ideas efficiently. +- **Residual Connection with Normalization**: In some variants of residual networks, the residual addition is immediately followed by normalization. This can be expressed as an elementwise addition (residual) followed by normalization. +- **Preprocessing Pipelines**: In data preprocessing, tensors might need elementwise transformations (e.g., color space conversion) followed by normalization (e.g., standardization). This kernel can fuse these operations. +- **Research Architectures**: Novel normalization techniques often involve custom elementwise operations before the normalization step. This kernel provides a flexible foundation for implementing such research ideas efficiently. diff --git a/example/62_convnd_activ/README.md b/example/62_convnd_activ/README.md index 588c853791..b6954bde72 100644 --- a/example/62_convnd_activ/README.md +++ b/example/62_convnd_activ/README.md @@ -1,6 +1,6 @@ # N-Dimensional Convolution with Activation -This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with element-wise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits. +This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits. ## Mathematical Formulation @@ -10,7 +10,7 @@ The operation performs an N-dimensional convolution followed immediately by an a $C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})$ Where `In` is the input tensor, `W` is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional. -2. **Activation Function**: Apply an element-wise activation function to the convolution result. +2. **Activation Function**: Apply an elementwise activation function to the convolution result. $\text{Out} = \text{Activation}(C_{temp})$ Common activation functions include: - **ReLU**: $\text{ReLU}(x) = \max(0, x)$ diff --git a/example/65_gemm_multiply_multiply/README.md b/example/65_gemm_multiply_multiply/README.md index a35e961b41..d6d169e29d 100644 --- a/example/65_gemm_multiply_multiply/README.md +++ b/example/65_gemm_multiply_multiply/README.md @@ -1,10 +1,10 @@ # GEMM with Double Multiply Operations -This example demonstrates a **GEMM followed by two sequential element-wise multiplication operations**. This fusion pattern is useful for implementing layers that require matrix multiplication followed by multiple scaling or masking operations, such as certain attention mechanisms or gated neural network architectures. +This example demonstrates a **GEMM followed by two sequential elementwise multiplication operations**. This fusion pattern is useful for implementing layers that require matrix multiplication followed by multiple scaling or masking operations, such as certain attention mechanisms or gated neural network architectures. ## Mathematical Formulation -The operation performs a matrix multiplication followed by two sequential element-wise multiplications. +The operation performs a matrix multiplication followed by two sequential elementwise multiplications. 1. **GEMM Stage**: A standard matrix multiplication. $C_{temp1} = A \times B$ @@ -25,9 +25,9 @@ The implementation uses a tiled GEMM algorithm with a multi-stage fused epilogue 2. **Dual-Multiply Epilogue**: Before any data is written to global memory, the following sequence occurs for the tile of data held in registers: - **Load First Multiplicand**: Threads load the corresponding elements of tensor `D`. - - **First Multiplication**: The element-wise multiplication is performed in registers: `result *= D`. + - **First Multiplication**: The elementwise multiplication is performed in registers: `result *= D`. - **Load Second Multiplicand**: Threads load the corresponding elements of tensor `E`. - - **Second Multiplication**: The second element-wise multiplication is performed in registers: `result *= E`. + - **Second Multiplication**: The second elementwise multiplication is performed in registers: `result *= E`. - **Store Final Result**: The final result `F` is written to global memory. This deep fusion eliminates multiple kernel launches and the memory bandwidth required to write and re-read intermediate tensors. @@ -101,7 +101,7 @@ The performance benefits of this fusion depend on several factors: This pattern can be extended in several ways: - **More Multiplications**: Additional sequential multiplications can be added to the epilogue -- **Mixed Operations**: Combine multiplications with additions or other element-wise operations +- **Mixed Operations**: Combine multiplications with additions or other elementwise operations - **Conditional Operations**: Apply multiplications conditionally based on masks or thresholds - **Broadcasting**: Handle different broadcasting patterns for the multiplicand tensors diff --git a/example/66_complex_contraction_bilinear/README.md b/example/66_complex_contraction_bilinear/README.md index fc05c4aced..1b43fdaea5 100644 --- a/example/66_complex_contraction_bilinear/README.md +++ b/example/66_complex_contraction_bilinear/README.md @@ -20,7 +20,7 @@ Given complex tensors with real and imaginary components: $F = \text{BilinearOp}(C_{temp}, D, E, \ldots)$ The bilinear operations can include various combinations such as: -- $F = C_{temp} \odot D + E$ (element-wise multiply and add) +- $F = C_{temp} \odot D + E$ (elementwise multiply and add) - $F = \alpha \cdot C_{temp} + \beta \cdot (D \odot E)$ (scaled combinations) - More complex multi-term bilinear expressions diff --git a/example/ck_tile/17_grouped_gemm/README.md b/example/ck_tile/17_grouped_gemm/README.md index 3fac4263b1..30ad79d07f 100644 --- a/example/ck_tile/17_grouped_gemm/README.md +++ b/example/ck_tile/17_grouped_gemm/README.md @@ -21,13 +21,13 @@ Persistence mode is a GPU optimization where thread blocks remain active on the - **Usage**: `invoke_gemm` enables persistence #### Multi-D Operations -Multi-D operations extend the standard GEMM operation by supporting additional element-wise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output. +Multi-D operations extend the standard GEMM operation by supporting additional elementwise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output. - **Implementation**: Available in `grouped_gemm_multi_d.cpp` - **Operation**: E = C × D₀ × D₁ (where C = A × B is the standard GEMM result) - **Configuration**: Uses `GemmConfigV3`, `GemmConfigV4`, `GemmConfigMemory` template configuration with 2 D tensors - **Data Types**: Supports fp16 -- **Benefits**: Enables complex operations like scaling, activation functions, or other element-wise transformations in a single kernel call +- **Benefits**: Enables complex operations like scaling, activation functions, or other elementwise transformations in a single kernel call - **Build Target**: `make tile_example_grouped_gemm_multi_d -j` Multi-D operations supports both persistence and non-persistence modes.