mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-15 10:37:44 +00:00
67 lines
2.0 KiB
Markdown
67 lines
2.0 KiB
Markdown
# Client Example: Split-K GEMM
|
|
|
|
## Theory
|
|
|
|
This client example demonstrates **Split-K GEMM**, a technique for parallelizing matrix multiplication along the K dimension. Split-K is used to improve parallelism and memory bandwidth utilization for large GEMM operations, especially when K is large.
|
|
|
|
**Mathematical Formulation:**
|
|
- Standard GEMM: $C = A \times B$
|
|
- Split-K: Partition the K dimension into $K_s$ splits, compute partial results, then reduce:
|
|
$$
|
|
C = \sum_{s=1}^{K_s} (A_{[:, K_s]} \times B_{[K_s, :]})
|
|
$$
|
|
|
|
**Algorithmic Background:**
|
|
- Each split computes a partial GEMM over a chunk of K.
|
|
- Partial results are reduced (summed) to produce the final output.
|
|
- Useful for large K, limited workspace, or maximizing GPU occupancy.
|
|
|
|
## How to Run
|
|
|
|
### Prerequisites
|
|
|
|
Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
|
|
|
|
### Build and run
|
|
```bash
|
|
cd composable_kernel/client_example/20_splitk_gemm
|
|
mkdir build && cd build
|
|
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
|
|
make -j
|
|
|
|
# Example run (FP16 compute, FP8 output)
|
|
./splitK_gemm_fp16_f8
|
|
```
|
|
|
|
## Source Code Structure
|
|
|
|
### Directory Layout
|
|
```
|
|
client_example/20_splitk_gemm/
|
|
├── splitK_gemm_fp16_f8.cpp # Main client example: Split-K GEMM (FP16 compute, FP8 output)
|
|
├── CMakeLists.txt # Build configuration for the example
|
|
```
|
|
|
|
### Key Functions
|
|
|
|
- **main()** (in `splitK_gemm_fp16_f8.cpp`):
|
|
Sets up input matrices, configures Split-K parameters, launches the Split-K GEMM kernel, and verifies the result.
|
|
- **Split-K kernel invocation**:
|
|
Uses the Composable Kernel device API to launch the Split-K GEMM operation.
|
|
|
|
---
|
|
|
|
## Additional Details
|
|
|
|
- Supports FP16 compute with FP8 output for memory efficiency.
|
|
- Example parameters can be adjusted in the source for different workloads.
|
|
|
|
---
|
|
|
|
## Related Examples
|
|
|
|
- [35_splitK_gemm](../../example/35_splitK_gemm/README.md): Split-K GEMM in the main example directory
|
|
|
|
---
|
|
[Back to Client Examples](../README.md)
|