* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
5.3 KiB
Fused Grouped-Convolution -> Convolution Forward
This example demonstrates a fused Grouped Convolution followed by a standard Convolution. This pattern is specifically designed to optimize Depthwise Separable Convolutions, a key building block in many modern, efficient convolutional neural networks like MobileNet.
A Depthwise Separable Convolution consists of two stages:
- Depthwise Convolution: A grouped convolution where the number of groups is equal to the number of input channels (
groups = in_channels). Each filter is applied to exactly one input channel. - Pointwise Convolution: A standard
1x1convolution that projects the channels from the depthwise stage into a new output channel space.
Fusing these two stages into a single kernel can provide significant performance benefits.
Mathematical Formulation
The operation computes a chain of two convolutions, where the first is a grouped convolution.
-
First Convolution (Conv0 - Grouped/Depthwise):
D_{temp} = \text{GroupedConv}(\text{In}, \text{W0})WhereInis the input tensor andW0are the weights for the grouped convolution. -
Second Convolution (Conv1 - Pointwise):
Out = \text{Conv}(\text{D}_{temp}, \text{W1})WhereD_tempis the output of the first stage andW1are the weights for the second convolution (typically1x1filters).
The critical optimization is that the intermediate tensor D_temp is never written to global memory. It is produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory), saving a massive amount of memory bandwidth.
Algorithmic Strategy: Fused Implicit GEMM-GEMM via Shared Memory
The implementation maps the two-stage convolution into a fused GEMM-GEMM problem, using shared memory as the communication buffer between the stages.
-
Grid Scheduling: The problem is parallelized across the thread blocks of the GPU. Each thread block is assigned to compute a tile of the final output tensor
Out. -
Fused Execution within a Thread Block: To compute its output tile, a thread block must perform both convolution stages for the corresponding input region.
- Compute Conv0 Tile: The thread block first computes a tile of the intermediate tensor,
D_{temp}, using the implicit GEMM algorithm for the grouped convolution. The result of this computation is stored directly into a designated region of shared memory (LDS). - Synchronization: A block-wide synchronization (
__syncthreads()) is performed. This is a critical step that ensures the entire tile ofD_{temp}is visible to all threads in the block before the second convolution begins. - Compute Conv1 Tile: The threads then immediately start computing the second convolution. They use the intermediate tile stored in shared memory as the input for this second stage, applying the implicit GEMM algorithm for the pointwise convolution. The result is accumulated in registers.
- Store Final Result: Once a tile of the final output
Outis computed, it is written to global memory.
- Compute Conv0 Tile: The thread block first computes a tile of the intermediate tensor,
This "producer-consumer" pattern within a thread block is highly efficient. It treats shared memory as a fast, programmable cache for the intermediate tensor, completely avoiding the slow round-trip to global HBM memory that would be required by two separate kernel calls.
Source Code Organization
grouped_conv_conv_fwd_xdl.cpp: The main example file. It sets up the input tensor and the two weight tensors (W0 for grouped, W1 for standard) and instantiates theDeviceGroupedConvConvFwdoperation.../../include/ck/tensor_operation/gpu/device/device_grouped_conv_conv_fwd.hpp: The high-level device interface for the fused convolution operation.- The underlying grid-wise kernel implements the complex fusion logic, managing the two implicit GEMM calculations and the data flow through shared memory.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/41_grouped_conv_conv_fwd
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./grouped_conv_conv_fwd_xdl
# Run with verification, data initialization, and timing
./grouped_conv_conv_fwd_xdl 1 2 1
Application to Efficient CNNs
As mentioned, this kernel is a direct, high-performance implementation of a Depthwise Separable Convolution. This architectural primitive is the foundation of many efficient CNNs, including:
- MobileNets (V1, V2, V3): Designed for high performance on mobile and edge devices.
- EfficientNets: A family of models that systematically scale model depth, width, and resolution to achieve high accuracy with fewer parameters and FLOPs.
- Xception: A model that takes the idea of separable convolutions to an extreme.
By providing a fused kernel for this common pattern, Composable Kernel allows developers to achieve significantly better performance for these models than would be possible by calling a library for the depthwise and pointwise stages separately.