mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Files

Christopher Millette 144854dba1 [rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650)

[CK_TILE] Optimize static_ford and sequence compile-time
 infrastructure (#5938)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Problem

Each `static_for<0, N, 1>` instantiates its lambda N times (one per
`number<I>` type). When nested, intermediate lambdas capture the outer
loop variable (a different type per iteration), creating unique closure
types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32
= 52 IR functions, of which 20 are intermediate closures that get
inlined away but still cost frontend compile time.

ck_tile's `static_ford` was supposed to eliminate these intermediates
(as old CK's PR #5031 did successfully), but it used a **recursive**
`static_ford_impl` that recreated the same closure pattern plus added
`reorder_old_to_new`/`reorder_new_to_old` overhead.

Additionally, the sequence utility layer (`sequence_sort`,
`is_valid_sequence_map`) used recursive template metaprogramming that
generated O(N log N) intermediate types for every permutation validation
— called on every `reorder_new_to_old`/`reorder_old_to_new` invocation.

## Changes

### 1. Replace `sequence_sort` with constexpr insertion sort
Replace recursive merge sort (`sequence_sort_impl` +
`sorted_sequence_merge_impl`, O(N log N) intermediate type
instantiations) with constexpr insertion sort using `static_array`. O(1)
template depth, same `::type` and `::sorted2unsorted_map` API.

### 2. Replace `is_valid_sequence_map` with constexpr check
Replace sort-based permutation validation (which instantiated the full
`sequence_sort` chain) with a constexpr "seen array" loop. O(N)
constexpr steps instead of O(N log N) template instantiations.

### 3. Replace recursive `static_ford` with flat-loop `index_decomposer`
Replace `static_ford_impl` (recursive `static_for` nesting +
`pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat
`index_decomposer` using pre-computed strides. Add `decompose_reordered`
alias that folds reordering into decomposition, and `inverse_perm`
helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map`
→ `sequence_sort` chain.

### 4. Eliminate internal lambda via `ford_applier`
The flat-loop approach still used `static_for` with a lambda, creating
M×N internal lambda instantiations per call site. Replace with
`ford_applier` struct that calls `f(decompose<I>{})` directly via fold
expression — zero intermediate closures:

```cpp
// Before: 2×M×N function instantiations
static_for<0, M*N, 1>{}([&](auto i) { f(decompose<i>{}); });

// After: M×N function instantiations (50% reduction)
ford_applier<Decomposer, make_index_sequence<M*N>>{}(f);
```

Also unified identity and non-identity order paths into a single
template with `constexpr if`.

### 5. Fix const-qualified sequence handling
Fix `is_valid_sequence_map` to handle const-qualified sequence types via
`remove_cvref_t` in callers (`tensor_adaptor.hpp`,
`tile_distribution_encoding.hpp`).

## Results (this PR only, without flattening)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5)

| Target | Base (s) | Treat (s) | Delta | % | Wins | Significant? |
|--------|----------|-----------|-------|---|------|-------------|
| **flatmm** | 160.1 | 152.7 | **-7.4s** | **-4.6%** | 6/7 | **YES**
(W+=1, p<0.05) |
| universal_gemm | 228.4 | 224.7 | -3.7s | -1.6% | 6/7 | Trending (W+=4)
|

Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5]
Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6]

### IR Function Counts (device trace, gfx942)

| Target | Metric | Before | After | Delta | % |
|--------|--------|--------|-------|-------|---|
| **universal_gemm** | InstantiateFunction | 117,715 | 109,165 |
**-8,550** | **-7.3%** |
| **universal_gemm** | CodeGen Function | 47,912 | 45,044 | **-2,868** |
**-6.0%** |
| **flatmm** | InstantiateFunction | 100,939 | 95,127 | **-5,812** |
**-5.8%** |
| **flatmm** | CodeGen Function | 42,651 | 40,367 | **-2,284** |
**-5.4%** |

Note: The `ford_applier` (commit 3) has minimal additional effect in
this PR since ck_tile code does not yet use `static_ford` extensively.
Its impact compounds when the follow-up flattening PR #5939 converts 124
`static_for` nests to `static_ford`. Combined results with #5939: flatmm
**-7.5%** wall time (p<0.01), CodeGen **-10.5%**.

### ASM Equivalence
7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 +
gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] `test_ck_tile_static_ford`: 13 behavioral tests
(identity/non-identity orders, 1D-4D, unit dimensions, edge cases)
- [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map,
is_valid_sequence_map edge cases, sequence_unique_sort map round-trip)
- [x] ASM equivalence verified (980K lines)
- [x] Wilcoxon timing verified (7 trials, flatmm p<0.05)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

2026-04-02 21:25:56 +00:00

add_rmsnorm2d_rdquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

atomic_add_op

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

batched_gemm

[rocm-libraries] ROCm/rocm-libraries#4335 (commit 06976b3)

2026-02-07 00:15:34 +00:00

batched_transpose

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

container

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

core

[rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650)

2026-04-02 21:25:56 +00:00

data_type

[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)

2026-03-19 09:19:06 +00:00

elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

epilogue

[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)

2026-03-19 09:19:06 +00:00

flatmm

[rocm-libraries] ROCm/rocm-libraries#5082 (commit 9313659)

2026-03-11 22:47:59 +00:00

fmha

[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7)

2026-04-02 03:17:45 +00:00

gemm

[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6)

2026-03-26 01:41:35 +00:00

gemm_block_scale

[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a)

2026-03-16 08:31:56 +00:00

gemm_multi_abd

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_multi_d

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_mx

[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)

2026-03-27 20:37:23 +00:00

gemm_persistent_async_input

Add persistent async input scheduler for GEMM kernels (#3520 )

2026-01-20 10:37:09 -08:00

gemm_streamk

[rocm-libraries] ROCm/rocm-libraries#5722 (commit 55febd2)

2026-04-02 21:07:13 +00:00

gemm_tile_engine

[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)

2026-04-01 07:32:36 +00:00

gemm_weight_preshuffle

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

grouped_conv

[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)

2026-03-27 09:18:14 +00:00

grouped_gemm

[rocm-libraries] ROCm/rocm-libraries#5050 (commit 033dad7)

2026-03-12 13:29:14 +00:00

grouped_gemm_abquant

[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 )

2026-01-28 19:45:09 -08:00

grouped_gemm_multi_d

[rocm-libraries] ROCm/rocm-libraries#4335 (commit 06976b3)

2026-02-07 00:15:34 +00:00

grouped_gemm_preshuffle

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

grouped_gemm_quant

[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)

2026-03-03 15:40:50 +00:00

image_to_column

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

layernorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

memory_copy

Mx fp6 flatmm (#3601 )

2026-02-02 16:04:40 +08:00

moe_smoothquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

moe_sorting

Update unsigned long literals and format specifiers to work correctly in Windows (#3483 )

2026-01-02 22:16:41 -07:00

permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling_tile_engine

[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)

2026-04-01 07:32:36 +00:00

reduce

[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)

2026-03-03 15:40:50 +00:00

rmsnorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

slice_tile

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

smoothquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

utility

[rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650)

2026-04-02 21:25:56 +00:00

warp_gemm

chore: update copyright header for misc files (#3402 )

2025-12-11 08:25:29 -08:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#5722 (commit 55febd2)

2026-04-02 21:07:13 +00:00

README.md

[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)

2026-03-03 15:40:50 +00:00

README.md

CK Tile Testing Guide

This document describes the test organization and available test targets for CK Tile operations.

Overview

CK Tile tests are organized with multiple levels of granularity to support different development workflows:

Global test labels - Run tests across all operations
Operation-specific umbrella targets - Run all tests for a specific operation
Individual test executables - Run specific tests

Global Test Labels

These targets run tests across all CK operations (not just CK Tile):

`ninja smoke`

Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).

ninja smoke

`ninja regression`

Run slower, more comprehensive regression tests.

ninja regression

`ninja check`

Run ALL available tests in the entire codebase.

ninja check

Operation-Specific Umbrella Targets

These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.

GEMM Operations

`ck_tile_gemm_tests`

Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)

ninja ck_tile_gemm_tests

Test executables included:

test_ck_tile_gemm_pipeline_mem
test_ck_tile_gemm_pipeline_compv3
test_ck_tile_gemm_pipeline_compv4
test_ck_tile_gemm_pipeline_persistent
test_ck_tile_gemm_pipeline_compv6
test_ck_tile_gemm_pipeline_comp_async (gfx95 only)
test_ck_tile_gemm_pipeline_*_wmma variants (gfx11/gfx12 only)

`ck_tile_gemm_block_scale_tests`

Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)

ninja ck_tile_gemm_block_scale_tests

Test executables included: 29 test executables covering:

AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
ABQuant tests (base, padding, preshuffle)
BQuant tests (1D/2D variants, transpose)
BQuant with PreshuffleB (decode/prefill, 1D/2D)
BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
RowColQuant and TensorQuant tests

`ck_tile_gemm_streamk_tests`

Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)

ninja ck_tile_gemm_streamk_tests

Test executables included:

test_ck_tile_streamk_tile_partitioner
test_ck_tile_streamk_reduction
test_ck_tile_streamk_smoke
test_ck_tile_streamk_extended

`ck_tile_grouped_gemm_quant_tests`

Run all grouped GEMM quantization tests

ninja ck_tile_grouped_gemm_quant_tests

Test executables included:

test_ck_tile_grouped_gemm_quant_rowcol
test_ck_tile_grouped_gemm_quant_tensor
test_ck_tile_grouped_gemm_quant_aquant
test_ck_tile_grouped_gemm_quant_bquant
test_ck_tile_grouped_gemm_quant_bquant_preshuffleb

Other Operations

`ck_tile_fmha_tests`

Run all FMHA (Flash Multi-Head Attention) tests

ninja ck_tile_fmha_tests

Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32

`ck_tile_reduce_tests`

Run all reduce operation tests

ninja ck_tile_reduce_tests

Test executables included:

test_ck_tile_reduce2d
test_ck_tile_multi_reduce2d_threadwise
test_ck_tile_multi_reduce2d_multiblock

Individual Test Executables

You can also build and run individual test executables:

Build a specific test

ninja test_ck_tile_gemm_pipeline_mem

Run a specific test directly

./build/bin/test_ck_tile_gemm_pipeline_mem

Run a specific test through ctest

ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure