[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - **Linear**: tile-starter sequentially accumulates partials from contributing CTAs - **Tree**: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear|tree` and `--streamk_persistent=0|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): | Shape | 16x64 tiles | | 128x128 tiles | | |---|---|---|---|---| | | Split-K | StreamK | Split-K | StreamK | | 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x | | 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x | | 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x | | 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x | | 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x | | 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x | | 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x | | 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x | StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Builder
This directory contains the experimental builder feature for composable_kernel.
- Status: In development (October 2025 - March 2026)
Overview
The builder provides a high-level, semantically-clear interface for constructing composable kernel operations, with an initial focus on convolution kernels for MIOpen. It leverages modern C++20 features (such as POD structs as non-type template parameters, concepts, and designated initializers) to simplify kernel instantiation and improve developer experience.
This project is a prototype for a more general builder pattern for all of composable_kernel (CK) and CK Tile, but is currently limited to formalizing the interface between MIOpen and CK.
Design Direction
The builder's primary goal is transparent dispatch across two backend implementations: old CK (template-heavy device operations) and CK Tile (modern tile-based API). MIOpen, the consumer library, should construct kernels through the builder without needing to know which backend provides the implementation.
Current state: The builder dispatches correctly, but each kernel variant (forward XDL, forward WMMA, backward weight XDL V3, etc.) has its own factory and its own algorithm descriptor shape. The result is 16+ per-variant facades rather than one unified facade. Unification across three axes — CK vs CK Tile backend, MFMA vs WMMA instruction set, and forward vs backward direction — is the central design challenge.
Three principles guide the design toward that unification:
-
Unified vocabulary through reflection. The reflection system (
reflect/) extracts kernel traits from both backends into a commonConvTraitsrepresentation. This shared vocabulary is the mechanism for discovering what algorithm parameters are truly variant-specific versus what can be expressed once and mapped to multiple backends. -
Expert overrides. Power users can pin to a specific backend or device operation when needed, bypassing automatic dispatch.
-
Versioned API evolution. The builder uses semantic version strings (
"0.0.0","0.1.0") to manage API changes predictably. TheConvBuildertemplate defaults to the latest version but accepts explicit version pinning.
Design descriptions
- CK Builder design description
- CK Builder factory design
- CK Builder testing design
- CK Builder reflection design
Directory Structure
include/ck_tile/builder/Core builder headers and public API.include/ck_tile/builder/reflectReflection mechanism.include/ck_tile/builder/factoryCompile-time dispatch from builder descriptors to our existing specialized convolution kernel implementations.test/Unit tests and example usage of the builder pattern.CMakeLists.txtCMake configuration for building the experimental builder and its tests.
CMake Configuration
To enable the experimental builder, configure your build with:
cmake \
-D CMAKE_PREFIX_PATH=/opt/rocm \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS="gfx942" \
-D CK_EXPERIMENTAL_BUILDER=ON \
-D CMAKE_CXX_STANDARD=20 \
-G Ninja \
..
Note: The tests for WMMA builders are only built when CK_USE_WMMA is enabled. Add e.g.
gfx1121 or any of the other gfx11/gfx12 architectures to the GPU targets. Alternatively,
one can add flag -D CK_USE_WMMA=ON to build the tests. For the end-to-end tests that use
the instances from builder, one needs an actual Navi card.
Building and Testing
The builder test suite is organized into two main categories:
Smoke Tests (Fast Unit Tests)
Quick unit tests that verify the builder's internal logic without compiling GPU kernels. These complete in under 1 second total and are suitable for frequent execution during development.
ninja smoke-builder
Regression Tests (Integration Tests)
Integration tests that compile actual GPU kernels to verify that the builder generates valid, compilable code. These are more expensive than smoke tests (can take minutes to compile) but cover more functionality.
ninja regression-builder
Running All Tests
To build and run the complete test suite:
ninja check-builder
Building Individual Tests
To build and run a specific test:
ninja test_ckb_conv_builder && bin/test_ckb_conv_builder
Test Organization
- Smoke tests: Fast feedback during active development
- Regression tests: Thorough validation before submitting changes
- Factory tests: Expensive tests that build all MIOpen kernels (included in regression tests)
When adding new tests, please follow the convention where the CMake build target starts with a prefix test_ckb. This allows filtering of CK Builder tests from the full CK repository test suite.