Files
composable_kernel/rocm_ck
John Shumway d65ad35b23 [rocm-libraries] ROCm/rocm-libraries#7180 (commit 54aed1e)
[CK] Add rocm_ck spec factories: GemmSpec, makeSpec() (#7180)

## What this PR does

This is the third PR in the rocm_ck schema stack:

1. **#7150** — Foundation types (DataType, Layout, Args, Ops)
2. **#7163** — Schema engine (Signature, resolve(), ArchProperties)
3. **#7180 (this)** — Spec factories (GemmSpec, makeSpec())

`makeSpec()` is the bridge between user intent and kernel instantiation.
It takes a **Signature** (WHAT to compute — operator graph, dtypes,
layouts) and a **GemmAlgorithm** (HOW to compute it — tile sizes,
pipeline, partitioning) and produces a validated `GemmSpec` — a
structural type ready to use as a non-type template parameter.

The key property: **every constraint is enforced at compile time.** An
invalid GEMM configuration is a compile error, not a runtime crash or
silent corruption. The 33 compile-fail tests are the executable
specification of what's allowed.

## What's interesting

**Physical tensor table.** Not every tensor in a compute graph needs
device memory. The intermediate result of `C = A * B` in a fused
GEMM+Add+ReLU lives only in registers. `makeSpec()` walks the operator
chain and determines which tensors are physical (need Args slots) and
which are intermediate. The output is a fixed-layout table: `[lhs, rhs,
output, D0?, D1?, scale?]`.

**Epilogue composition.** Instead of a combinatorial explosion of named
patterns (GemmAdd, GemmAddRelu, GemmMulSilu, ...), the epilogue is a
composable chain of ops. `{GemmOp, AddOp, ReluOp}` produces
`epilogue_ops = {Add, Relu}` with the bias tensor automatically slotted
as D0. Two consecutive AddOps fold into a single Add with two D tensors
via CK Tile's parameter pack.

**Signature/Algorithm split.** The same Signature can pair with multiple
GemmAlgorithms to produce different tuning variants without changing the
mathematical result. This is the foundation for the dispatcher — one
operation description, many tile configurations.

## New types

| Type | Role |
|------|------|
| `GemmSpec` | Validated NTTP kernel descriptor — physical tensors, tile
geometry, epilogue chain |
| `GemmAlgorithm` | User-facing tuning input — tile sizes, pipeline,
partitioning, padding |
| `EpilogueOp` | NTTP-compatible projection of the Op variant for
epilogue chains |
| `Dim3` | M x N x K triple for tile geometry |

## Test coverage

- **69 unit tests** — happy paths, layouts, dtypes, quantization,
epilogue chains, algorithm variants
- **33 compile-fail tests** — one per constraint (tile divisibility,
INT8 rules, pipeline restrictions, etc.)
- **6 schema compatibility baselines** — frozen specs that break if the
schema changes

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-26 12:07:31 +02:00
..

rocm_ck

A C++20 constexpr API for configuring and distributing CK Tile GPU kernels across multiple architectures.

Status: Early development. Foundation types (DataType, Layout, Args, operators, FixedString, PhysicalTensor, ResolvedTensor), schema engine (Signature, resolve(), ArchProperties), and spec factories (GemmSpec, GemmAlgorithm, makeSpec(), validate) are in place. The device bridge is under active development.

Why rocm_ck exists

CK Tile kernels are C++ templates. A GEMM kernel's tile size, pipeline strategy, data types, and epilogue are all template parameters — fixed at compile time. This is excellent for performance (zero-overhead abstraction, full inlining), but it creates a problem for multi-architecture distribution: the host program must be compiled separately from device code, and the host compiler must never see CK Tile headers.

rocm_ck solves this by introducing a host-device boundary built on constexpr data rather than template parameters:

  1. On the host side, kernel configurations are plain C++20 structs (Signature, Algorithm, GemmSpec). These are constexpr data — they describe what to compute and how, without instantiating any templates. Host code reasons about kernels using values, not types.

  2. On the device side, a thin bridge layer lowers these constexpr descriptions into CK Tile template instantiations. Each GemmSpec maps to exactly one ck_tile::GemmPipeline<...> specialization.

  3. At the boundary, pre-compiled kernels are packaged into kpack archives — self-describing, compressed, multi-architecture bundles. The host loads kernels at runtime by matching a GemmSpec against the kpack table of contents. No recompilation, no template instantiation on the host.

This separation is what makes CK Tile viable in TheRock's multi-arch build system, where a single host binary must work with device code compiled for many GPU targets (e.g. gfx90a, gfx942, gfx1151).

The constexpr schema model

Traditional GPU kernel libraries select kernels through template parameters or runtime enums. rocm_ck uses a third approach: constexpr structs that are validated at compile time and lowered to templates on the device side.

A kernel configuration has two axes:

  • Signaturewhat the kernel computes: a directed graph of operators (GemmOp, AddOp, ReluOp, ...) connecting named tensor slots. Data types, layouts, and batch dimensions are part of the signature.

  • Algorithmhow the kernel computes it: tile geometry, pipeline strategy, warp layout, padding, and scheduling. These are tuning parameters that don't change the mathematical result.

The Signature and Algorithm are plain aggregate structs with designated initializers — no constructors, no inheritance, no runtime polymorphism. Validation happens in consteval functions: invalid configurations (unsupported tile size, incompatible data types, missing tensor slots) fail at compile time with actionable error messages.

Here is a preview of the API direction (not yet implemented):

// Host side — pure constexpr, any C++20 compiler, no CK headers
constexpr Signature sig = {
    .dtype = DataType::FP16,
    .ops = {
        GemmOp{.lhs = "A", .rhs = "B", .out = "C"},
        AddOp{.lhs = "C", .rhs = "bias", .out = "D"},
        ReluOp{.in = "D", .out = "E"},
    },
};

// Device side — make_kernel lowers to a CK Tile template instantiation.
// Compiled separately per architecture, packaged into .kpack archives.

Directory layout

rocm_ck/
├── CMakeLists.txt        # INTERFACE library, C++20, ck_tile_headers target
├── include/rocm_ck/      # Public headers — host-safe, no CK/HIP deps
├── src/                  # (planned) Device bridge, kpack loading
└── tests/
    ├── CMakeLists.txt    # Test tiers: ROCM_CK_SMOKE, ROCM_CK_KERNEL
    ├── unit/             # Fast host-only tests (< 1s, no GPU)
    ├── compile_fail/     # Static assertion tests — verify invalid configs fail at compile time
    └── kernel/           # (planned) GPU kernel tests

Build

rocm_ck is a CK feature, gated by CK_ENABLE_ROCM_CK:

cd composablekernel
cmake -B build -S . -G Ninja \
    -DCK_ENABLE_ROCM_CK=ON \
    -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++

ninja -C build smoke-rocm-ck    # host-only smoke tests
ninja -C build check-rocm-ck    # all rocm_ck tests
ctest --test-dir build -L ROCM_CK_SMOKE --output-on-failure

Default CK builds (CK_ENABLE_ROCM_CK=OFF) are unaffected.