mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-21 23:57:39 +00:00
Implement device_grouped_gemm_fixed_nk_bias for RDNA4 ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests **This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR.** ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Grouped GEMM
Theory
This example demonstrates grouped GEMM: performing multiple independent GEMM operations (with potentially different shapes) in a single kernel launch. Grouped GEMM is used in transformer models (e.g., multi-head attention), mixture-of-experts, and other architectures requiring heterogeneous batched matrix multiplications.
Mathematical Formulation:
For G groups, each with its own A_g, B_g, C_g:
C_g = A_g \times B_g \quad \text{for} \quad g = 1, 2, ..., G
A_g: [M_g, K_g] input matrix for groupgB_g: [K_g, N_g] weight matrix for groupgC_g: [M_g, N_g] output matrix for groupg
Algorithmic Background:
- Each group can have different matrix sizes and strides.
- The kernel launches a grid covering all groups, with each block assigned to a group.
- Useful for variable-length sequences, multi-head attention, and expert routing.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/example/15_grouped_gemm
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
Run example_grouped_gemm_xdl
#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_grouped_gemm_xdl_fp16 0 1 5
Source Code Structure
Directory Layout
example/15_grouped_gemm/
├── grouped_gemm_xdl.cpp # Main example: sets up, runs, and verifies grouped GEMM
include/ck/tensor_operation/gpu/device/
│ └── device_grouped_gemm_xdl.hpp # Device-level grouped GEMM API
include/ck/tensor_operation/gpu/grid/
│ └── gridwise_grouped_gemm_xdl.hpp # Grid-level grouped GEMM kernel
Key Classes and Functions
- DeviceGroupedGemmXdl (in
device_grouped_gemm_xdl.hpp):
Device API for grouped GEMM. - gridwise_grouped_gemm_xdl (in
gridwise_grouped_gemm_xdl.hpp):
Implements the tiled/blocking grouped GEMM kernel.
This example demonstrates how Composable Kernel supports efficient heterogeneous batched matrix multiplication for advanced AI/ML workloads.