Files
composable_kernel/experimental/builder/include/ck_tile/builder/reflect
Johannes Graner 58475d3f45 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions
 (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 09:18:14 +00:00
..

Convolution Reflection Directory

This directory contains tools for "reflecting" on convolution kernel instances. It allows developers to inspect the compile-time configuration of a kernel and generate detailed, human-readable descriptions.

See the main builder documentation for an overview.

Design Overview

The reflection system works by extracting properties from a convolution kernel type and formatting them into a string. This is useful for debugging, performance tuning, and generating documentation.

  1. Trait Extraction: The ConvTraits template (in conv_traits.hpp) is specialized for each kernel instance. It extracts low-level details like tile sizes, data layouts, and pipeline versions from the kernel's type definition. This template is common for XDL and WMMA, forward and backward weight kernels. std::optional is used for parameters that are only used by some kernels.

  2. Description Generation: The describe<Instance>() function (in conv_description.hpp) uses ConvTraits to populate a ConvDescription (Description) object.

  3. Formatting: The ConvDescription class (which implements Description) contains methods like brief() and detailed() that format the extracted properties into well-structured strings for display.

Key Files

  • description.hpp: The generalized Description base class with no implementation.

  • conv_description.hpp: The main entry point. Contains the ConvDescription struct and the describe() factory function.

  • conv_traits.hpp: Home of the ConvTraits template, which is the core of the property extraction mechanism.

  • tree_formatter.hpp: A tree-building utility that generates indented, tree-like output for the detailed() description.

Usage

To get a description of a convolution kernel instance, use the describe function and call one of its formatting methods:

#include "ck_tile/builder/reflect/conv_description.hpp"

// Assume MyConvFwdInstance is a type alias for a specific kernel instance
using MyConvFwdInstance = /* ... some kernel type ... */;

// Describe the instance
const auto description = ck_tile::reflect::conv::Describe<MyConvFwdInstance>();

// Print the detailed description
std::cout << description.detailed() << std::endl;

Appendix: Current Limitations

Supported Instance Types

The reflection system (ckr::describe) currently supports the following convolution instance types:

  • Standard XDL Forward Convolution (DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle)
  • Large Tensor XDL Forward Convolution (DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor)
  • V3 XDL Forward Convolution (DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3)
  • WMMA Forward Convolution (DeviceGroupedConvFwdMultipleD_Wmma_CShuffle)
  • XDL Backward Weight Convolution (DeviceGroupedConvBwdWeight_Xdl_CShuffle)
  • V3 XDL Backward Weight Convolution (DeviceGroupedConvBwdWeight_Xdl_CShuffleV3)
  • XDL Multiple D Backward Weight Convolution (DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle)
  • Two Stage XDL Backward Weight Convolution (DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle)
  • V3 Two Stage XDL Backward Weight Convolution (DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3)
  • Wmma Backward Weight Convolution (DeviceGroupedConvBwdWeight_Wmma_CShuffle)
  • V3 Wmma Backward Weight Convolution (DeviceGroupedConvBwdWeight_Wmma_CShuffleV3)
  • V3 Wmma Multiple D Backward Weight Convolution (DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3)

These variants all share similar template parameter structures and are compatible with the current ConvTraits implementation.

CK Tile Instance Types

The reflection system also provides InstanceTraits specializations for CK Tile kernel instances:

  • Tile Forward Convolution (GroupedConvolutionForwardKernel)
  • Tile Backward Weight Convolution (GroupedConvolutionBackwardWeightKernel)
  • Tile Backward Data Convolution (GroupedConvolutionBackwardDataKernel)
  • Reference Convolution (reference implementation)

Unsupported Instance Types

  • DL (non-XDLops) Forward (DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK) has InstanceTraits but uses a different internal parameter structure (K0PerBlock, K1, M1PerThread instead of standard block/warp parameters). It can use GetInstanceString() through the base class pointer but cannot use describe().

Reflection Coverage: ConvTraits Bridge

The reflection system operates at two levels:

  1. InstanceTraits (compile-time): Extracts raw template parameters from a kernel type. Specializations exist for both old CK and CK Tile instances.

  2. ConvTraits (runtime): A unified, type-erased data structure representing kernel configuration in convolution-specific terms. Populated by instance_to_conv_traits<Instance>() specializations.

ConvTraits captures the common ground shared by both backends: spatial dimensions, tensor layouts, data types, elementwise operations, tile dimensions, pipeline version/scheduler, and memory access patterns. Within old CK, ConvTraits already unifies across the MFMA/WMMA instruction set boundary — XDL and WMMA forward instances both produce the same ConvTraits representation, demonstrating that instruction-set differences can be abstracted at this level.

Currently, instance_to_conv_traits() specializations exist only for old CK instances (forward XDL, XDL V3, WMMA, large tensor, and 8 backward weight variants). CK Tile instances have InstanceTraits but lack instance_to_conv_traits() specializations — there is no bridge from CK Tile's InstanceTraits to the unified ConvTraits representation.

This is the critical gap in the reflection system. Today the builder has 16+ per-variant factories, each with its own algorithm descriptor shape. ConvTraits is the mechanism for discovering which parameters are genuinely variant-specific versus which can be expressed in a single unified algorithm descriptor. Closing the CK Tile bridge means writing instance_to_conv_traits() specializations for the CK Tile kernel types that map their InstanceTraits fields to the ConvTraits struct. Once this bridge exists, both backends produce the same ConvTraits output — making it possible to define a single algorithm descriptor format that the dispatcher decomposes into variant-specific parameters internally.

Future Work

The priorities for the reflection system are:

  1. CK Tile ConvTraits bridge. Write instance_to_conv_traits() specializations for GroupedConvolutionForwardKernel, GroupedConvolutionBackwardWeightKernel, and GroupedConvolutionBackwardDataKernel. This is the prerequisite for unified algorithm descriptors.

  2. DL variant support. The DL forward kernel needs a specialized ConvTraits mapping due to its different internal parameter structure.

  3. Generalization beyond convolution. ConvTraits is designed to evolve toward a more general KernelTraits covering GEMM, flash attention, and other operations.