ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-04 21:51:28 +00:00

Files

History

Vidyasagar Ananthan 92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495 )

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

2025-10-16 03:10:57 -07:00

CMakeLists.txt

Refactoring cmake files to build data types separately. (#932 )

2023-09-20 22:15:56 -07:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

splitk_gemm_bias_e_permute_xdl_fp16.cpp

[CK][Examples] Extending support for rdna3/4 part 3:

2025-10-08 18:14:38 +02:00

splitk_gemm_bias_e_permute_xdl_fp32.cpp

[CK][Examples] Extending support for rdna3/4 in following examples: (#2884 )

2025-09-29 09:05:04 -07:00

README.md

Split-K GEMM with Bias, Elementwise Operation, and Permutation

This example demonstrates a highly complex fusion: a Split-K GEMM where the final result is fused with a bias addition, a second elementwise operation, and a final permutation. This kernel combines the parallelism-enhancing Split-K strategy with a multi-stage epilogue, making it suitable for accelerating very large or "skinny" GEMMs that are part of a more complex computational graph.

Mathematical Formulation

The operation first computes a GEMM using the Split-K algorithm and then applies a sequence of fused operations.

Split-K GEMM Stage: The matrix multiplication C_{temp1} = A \times B is computed by splitting the K dimension into S chunks and summing the partial products. C_{temp1} = \sum_{s=0}^{S-1} (A_s \times B_s)
Bias Addition Stage: A bias vector D is broadcast and added. C_{temp2} = C_{temp1} + D
Elementwise Stage: A second elementwise operation is performed with tensor E. C_{temp3} = C_{temp2} \odot E
Permutation Stage: The final result is permuted. F = \text{permute}(C_{temp3})

The key is that the reduction (summation) of the partial GEMM products is fused with the entire epilogue chain (Bias, E-wise, Permute).

Algorithmic Strategy: Split-K with a Fused Reduction Epilogue

The implementation combines the Split-K algorithm with the multi-stage fused epilogue seen in previous examples.

Splitting the K-Dimension: The K dimension is logically split into S parts to create S parallel partial GEMM problems.
Parallel Partial GEMMs: The S partial GEMMs are executed in parallel across the GPU's thread blocks. A thread block is assigned to compute a tile of a partial product C_s.
Fused Reduction and Epilogue: The method for reducing the partial sums and applying the epilogue is critical.
- Workspace Approach: A common strategy is to use a temporary workspace in global memory.
  - Stage 1 (Partial Products): Each of the S parallel GEMMs computes its partial product C_s and writes it to a unique slice of a temporary workspace tensor.
  - Stage 2 (Reduce + Epilogue): A second, specialized kernel is launched. This kernel reads the S partial products from the workspace, reduces (sums) them on-the-fly, and then immediately applies the full Bias-E-Permute epilogue before writing the final result F to memory.
- Atomic-based Approach: For some data types and operations, it's possible to perform the reduction using atomic operations. The first block to arrive at an output element would compute its partial result, apply the epilogue, and write it out. Subsequent blocks would compute their partial results, read the intermediate value from the output buffer, add their contribution, and then atomically write the new sum back. This is more complex and often less performant due to atomic contention.

Composable Kernel's implementation abstracts this complexity, providing a single device-level operation that manages the workspace, the two stages, and the complex epilogue.

Source Code Organization

splitk_gemm_bias_e_permute_xdl.cpp: The main example file. It sets up the GEMM problem, the bias and elementwise tensors, the permutation, and instantiates the DeviceSplitkGemmBiasEPermute operation.
The device-level interface and underlying kernels are highly specialized. They manage the Split-K parameter, the workspace allocation (if needed), and the two-stage execution process, combining the logic from DeviceGemmSplitK and DeviceGemmBiasEPermute.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/43_splitk_gemm_bias_e_permute
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./splitk_gemm_bias_e_permute_xdl

# Run with verification, data initialization, and timing
./splitk_gemm_bias_e_permute_xdl 1 2 1

Applications

This highly specialized kernel is useful when a very large GEMM (that would benefit from Split-K) is immediately followed by a series of operations that can be fused.

Large Feed-Forward Networks: In a Transformer with a very large hidden dimension, the GEMMs in the FFN block might become "skinny" (large K, smaller M/N). If this FFN is also fused with residual connections (bias/add) and layout permutations, this kernel could be a perfect fit, offering both the parallelism benefits of Split-K and the memory bandwidth savings of the fused epilogue.
Final Classifier Layers: The final layer of a large classification model is often a very large GEMM. If this layer's output needs to be reshaped or post-processed, this kernel could fuse those operations directly into the Split-K GEMM.

This example showcases the extreme composability of the library, allowing for the creation of highly tailored, high-performance kernels that combine different algorithmic strategies (like Split-K) with deep fusion.