Files
composable_kernel/rocm_ck/README.md
John Shumway d65ad35b23 [rocm-libraries] ROCm/rocm-libraries#7180 (commit 54aed1e)
[CK] Add rocm_ck spec factories: GemmSpec, makeSpec() (#7180)

## What this PR does

This is the third PR in the rocm_ck schema stack:

1. **#7150** — Foundation types (DataType, Layout, Args, Ops)
2. **#7163** — Schema engine (Signature, resolve(), ArchProperties)
3. **#7180 (this)** — Spec factories (GemmSpec, makeSpec())

`makeSpec()` is the bridge between user intent and kernel instantiation.
It takes a **Signature** (WHAT to compute — operator graph, dtypes,
layouts) and a **GemmAlgorithm** (HOW to compute it — tile sizes,
pipeline, partitioning) and produces a validated `GemmSpec` — a
structural type ready to use as a non-type template parameter.

The key property: **every constraint is enforced at compile time.** An
invalid GEMM configuration is a compile error, not a runtime crash or
silent corruption. The 33 compile-fail tests are the executable
specification of what's allowed.

## What's interesting

**Physical tensor table.** Not every tensor in a compute graph needs
device memory. The intermediate result of `C = A * B` in a fused
GEMM+Add+ReLU lives only in registers. `makeSpec()` walks the operator
chain and determines which tensors are physical (need Args slots) and
which are intermediate. The output is a fixed-layout table: `[lhs, rhs,
output, D0?, D1?, scale?]`.

**Epilogue composition.** Instead of a combinatorial explosion of named
patterns (GemmAdd, GemmAddRelu, GemmMulSilu, ...), the epilogue is a
composable chain of ops. `{GemmOp, AddOp, ReluOp}` produces
`epilogue_ops = {Add, Relu}` with the bias tensor automatically slotted
as D0. Two consecutive AddOps fold into a single Add with two D tensors
via CK Tile's parameter pack.

**Signature/Algorithm split.** The same Signature can pair with multiple
GemmAlgorithms to produce different tuning variants without changing the
mathematical result. This is the foundation for the dispatcher — one
operation description, many tile configurations.

## New types

| Type | Role |
|------|------|
| `GemmSpec` | Validated NTTP kernel descriptor — physical tensors, tile
geometry, epilogue chain |
| `GemmAlgorithm` | User-facing tuning input — tile sizes, pipeline,
partitioning, padding |
| `EpilogueOp` | NTTP-compatible projection of the Op variant for
epilogue chains |
| `Dim3` | M x N x K triple for tile geometry |

## Test coverage

- **69 unit tests** — happy paths, layouts, dtypes, quantization,
epilogue chains, algorithm variants
- **33 compile-fail tests** — one per constraint (tile divisibility,
INT8 rules, pipeline restrictions, etc.)
- **6 schema compatibility baselines** — frozen specs that break if the
schema changes

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-26 12:07:31 +02:00

115 lines
4.7 KiB
Markdown

# rocm_ck
A C++20 constexpr API for configuring and distributing
[CK Tile](../include/ck_tile/) GPU kernels across multiple architectures.
> **Status**: Early development. Foundation types (DataType, Layout, Args,
> operators, FixedString, PhysicalTensor, ResolvedTensor), schema engine
> (Signature, resolve(), ArchProperties), and spec factories (GemmSpec,
> GemmAlgorithm, makeSpec(), validate) are in place. The device bridge
> is under active development.
## Why rocm_ck exists
CK Tile kernels are C++ templates. A GEMM kernel's tile size, pipeline
strategy, data types, and epilogue are all template parameters — fixed at
compile time. This is excellent for performance (zero-overhead abstraction,
full inlining), but it creates a problem for multi-architecture distribution:
the host program must be compiled separately from device code, and the host
compiler must never see CK Tile headers.
rocm_ck solves this by introducing a **host-device boundary** built on
constexpr data rather than template parameters:
1. **On the host side**, kernel configurations are plain C++20 structs
(`Signature`, `Algorithm`, `GemmSpec`). These are constexpr data —
they describe *what* to compute and *how*, without instantiating any
templates. Host code reasons about kernels using values, not types.
2. **On the device side**, a thin bridge layer lowers these constexpr
descriptions into CK Tile template instantiations. Each `GemmSpec`
maps to exactly one `ck_tile::GemmPipeline<...>` specialization.
3. **At the boundary**, pre-compiled kernels are packaged into
[kpack archives](https://github.com/ROCm/TheRock/blob/main/docs/rfcs/RFC0008-Multi-Arch-Packaging.md) —
self-describing, compressed, multi-architecture bundles. The host loads kernels at runtime
by matching a `GemmSpec` against the kpack table of contents. No
recompilation, no template instantiation on the host.
This separation is what makes CK Tile viable in
[TheRock](https://github.com/ROCm/TheRock)'s multi-arch build system,
where a single host binary must work with device code compiled for
many GPU targets (e.g. gfx90a, gfx942, gfx1151).
## The constexpr schema model
Traditional GPU kernel libraries select kernels through template
parameters or runtime enums. rocm_ck uses a third approach: **constexpr
structs that are validated at compile time and lowered to templates on
the device side.**
A kernel configuration has two axes:
- **Signature** — *what* the kernel computes: a directed graph of
operators (`GemmOp`, `AddOp`, `ReluOp`, ...) connecting named tensor
slots. Data types, layouts, and batch dimensions are part of the
signature.
- **Algorithm** — *how* the kernel computes it: tile geometry, pipeline
strategy, warp layout, padding, and scheduling. These are tuning
parameters that don't change the mathematical result.
The `Signature` and `Algorithm` are plain aggregate structs with
designated initializers — no constructors, no inheritance, no runtime
polymorphism. Validation happens in `consteval` functions: invalid
configurations (unsupported tile size, incompatible data types, missing
tensor slots) fail at compile time with actionable error messages.
Here is a preview of the API direction (not yet implemented):
```cpp
// Host side — pure constexpr, any C++20 compiler, no CK headers
constexpr Signature sig = {
.dtype = DataType::FP16,
.ops = {
GemmOp{.lhs = "A", .rhs = "B", .out = "C"},
AddOp{.lhs = "C", .rhs = "bias", .out = "D"},
ReluOp{.in = "D", .out = "E"},
},
};
// Device side — make_kernel lowers to a CK Tile template instantiation.
// Compiled separately per architecture, packaged into .kpack archives.
```
## Directory layout
```text
rocm_ck/
├── CMakeLists.txt # INTERFACE library, C++20, ck_tile_headers target
├── include/rocm_ck/ # Public headers — host-safe, no CK/HIP deps
├── src/ # (planned) Device bridge, kpack loading
└── tests/
├── CMakeLists.txt # Test tiers: ROCM_CK_SMOKE, ROCM_CK_KERNEL
├── unit/ # Fast host-only tests (< 1s, no GPU)
├── compile_fail/ # Static assertion tests — verify invalid configs fail at compile time
└── kernel/ # (planned) GPU kernel tests
```
## Build
rocm_ck is a CK feature, gated by `CK_ENABLE_ROCM_CK`:
```bash
cd composablekernel
cmake -B build -S . -G Ninja \
-DCK_ENABLE_ROCM_CK=ON \
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
ninja -C build smoke-rocm-ck # host-only smoke tests
ninja -C build check-rocm-ck # all rocm_ck tests
ctest --test-dir build -L ROCM_CK_SMOKE --output-on-failure
```
Default CK builds (`CK_ENABLE_ROCM_CK=OFF`) are unaffected.