ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-20 14:59:17 +00:00

Files

Vidyasagar Ananthan 92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495 )

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

2025-10-16 03:10:57 -07:00

5.3 KiB

Raw Blame History

Fused Grouped-Convolution -> Convolution Forward

This example demonstrates a fused Grouped Convolution followed by a standard Convolution. This pattern is specifically designed to optimize Depthwise Separable Convolutions, a key building block in many modern, efficient convolutional neural networks like MobileNet.

A Depthwise Separable Convolution consists of two stages:

Depthwise Convolution: A grouped convolution where the number of groups is equal to the number of input channels (groups = in_channels). Each filter is applied to exactly one input channel.
Pointwise Convolution: A standard 1x1 convolution that projects the channels from the depthwise stage into a new output channel space.

Fusing these two stages into a single kernel can provide significant performance benefits.

Mathematical Formulation

The operation computes a chain of two convolutions, where the first is a grouped convolution.

First Convolution (Conv0 - Grouped/Depthwise): D_{temp} = \text{GroupedConv}(\text{In}, \text{W0}) Where In is the input tensor and W0 are the weights for the grouped convolution.
Second Convolution (Conv1 - Pointwise): Out = \text{Conv}(\text{D}_{temp}, \text{W1}) Where D_temp is the output of the first stage and W1 are the weights for the second convolution (typically 1x1 filters).

The critical optimization is that the intermediate tensor D_temp is never written to global memory. It is produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory), saving a massive amount of memory bandwidth.

Algorithmic Strategy: Fused Implicit GEMM-GEMM via Shared Memory

The implementation maps the two-stage convolution into a fused GEMM-GEMM problem, using shared memory as the communication buffer between the stages.

Grid Scheduling: The problem is parallelized across the thread blocks of the GPU. Each thread block is assigned to compute a tile of the final output tensor Out.
Fused Execution within a Thread Block: To compute its output tile, a thread block must perform both convolution stages for the corresponding input region.
- Compute Conv0 Tile: The thread block first computes a tile of the intermediate tensor, D_{temp}, using the implicit GEMM algorithm for the grouped convolution. The result of this computation is stored directly into a designated region of shared memory (LDS).
- Synchronization: A block-wide synchronization (__syncthreads()) is performed. This is a critical step that ensures the entire tile of D_{temp} is visible to all threads in the block before the second convolution begins.
- Compute Conv1 Tile: The threads then immediately start computing the second convolution. They use the intermediate tile stored in shared memory as the input for this second stage, applying the implicit GEMM algorithm for the pointwise convolution. The result is accumulated in registers.
- Store Final Result: Once a tile of the final output Out is computed, it is written to global memory.

This "producer-consumer" pattern within a thread block is highly efficient. It treats shared memory as a fast, programmable cache for the intermediate tensor, completely avoiding the slow round-trip to global HBM memory that would be required by two separate kernel calls.

Source Code Organization

grouped_conv_conv_fwd_xdl.cpp: The main example file. It sets up the input tensor and the two weight tensors (W0 for grouped, W1 for standard) and instantiates the DeviceGroupedConvConvFwd operation.
../../include/ck/tensor_operation/gpu/device/device_grouped_conv_conv_fwd.hpp: The high-level device interface for the fused convolution operation.
The underlying grid-wise kernel implements the complex fusion logic, managing the two implicit GEMM calculations and the data flow through shared memory.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/41_grouped_conv_conv_fwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./grouped_conv_conv_fwd_xdl

# Run with verification, data initialization, and timing
./grouped_conv_conv_fwd_xdl 1 2 1

Application to Efficient CNNs

As mentioned, this kernel is a direct, high-performance implementation of a Depthwise Separable Convolution. This architectural primitive is the foundation of many efficient CNNs, including:

MobileNets (V1, V2, V3): Designed for high performance on mobile and edge devices.
EfficientNets: A family of models that systematically scale model depth, width, and resolution to achieve high accuracy with fewer parameters and FLOPs.
Xception: A model that takes the idea of separable convolutions to an extreme.

By providing a fused kernel for this common pattern, Composable Kernel allows developers to achieve significantly better performance for these models than would be possible by calling a library for the depthwise and pointwise stages separately.

5.3 KiB Raw Blame History

Fused Grouped-Convolution -> Convolution Forward

Mathematical Formulation

Algorithmic Strategy: Fused Implicit GEMM-GEMM via Shared Memory

Source Code Organization

Build and Run

Prerequisites

Build the Example

Run the Example

Application to Efficient CNNs

5.3 KiB

Raw Blame History