Files
composable_kernel/client_example/20_splitk_gemm
Aviral Goel c8563f2101 chore(copyright): update copyright header for test directory (#3252)
* chore(copyright): update copyright header for test directory

* chore(copyright): update copyright header for test directory

* chore(copyright): update copyright header for client_example directory

* chore(copyright): update copyright header for test directory
2025-11-20 20:36:57 -05:00
..

Client Example: Split-K GEMM

Theory

This client example demonstrates Split-K GEMM, a technique for parallelizing matrix multiplication along the K dimension. Split-K is used to improve parallelism and memory bandwidth utilization for large GEMM operations, especially when K is large.

Mathematical Formulation:

  • Standard GEMM: C = A \times B
  • Split-K: Partition the K dimension into K_s splits, compute partial results, then reduce:
    
    C = \sum_{s=1}^{K_s} (A_{[:, K_s]} \times B_{[K_s, :]})
    

Algorithmic Background:

  • Each split computes a partial GEMM over a chunk of K.
  • Partial results are reduced (summed) to produce the final output.
  • Useful for large K, limited workspace, or maximizing GPU occupancy.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

Build and run

cd composable_kernel/client_example/20_splitk_gemm
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (FP16 compute, FP8 output)
./splitK_gemm_fp16_f8

Source Code Structure

Directory Layout

client_example/20_splitk_gemm/
├── splitK_gemm_fp16_f8.cpp         # Main client example: Split-K GEMM (FP16 compute, FP8 output)
├── CMakeLists.txt                  # Build configuration for the example

Key Functions

  • main() (in splitK_gemm_fp16_f8.cpp):
    Sets up input matrices, configures Split-K parameters, launches the Split-K GEMM kernel, and verifies the result.
  • Split-K kernel invocation:
    Uses the Composable Kernel device API to launch the Split-K GEMM operation.

Additional Details

  • Supports FP16 compute with FP8 output for memory efficiency.
  • Example parameters can be adjusted in the source for different workloads.


Back to Client Examples