mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-28 10:47:00 +00:00
[CK] Add rocm_ck spec factories: GemmSpec, makeSpec() (#7180) ## What this PR does This is the third PR in the rocm_ck schema stack: 1. **#7150** — Foundation types (DataType, Layout, Args, Ops) 2. **#7163** — Schema engine (Signature, resolve(), ArchProperties) 3. **#7180 (this)** — Spec factories (GemmSpec, makeSpec()) `makeSpec()` is the bridge between user intent and kernel instantiation. It takes a **Signature** (WHAT to compute — operator graph, dtypes, layouts) and a **GemmAlgorithm** (HOW to compute it — tile sizes, pipeline, partitioning) and produces a validated `GemmSpec` — a structural type ready to use as a non-type template parameter. The key property: **every constraint is enforced at compile time.** An invalid GEMM configuration is a compile error, not a runtime crash or silent corruption. The 33 compile-fail tests are the executable specification of what's allowed. ## What's interesting **Physical tensor table.** Not every tensor in a compute graph needs device memory. The intermediate result of `C = A * B` in a fused GEMM+Add+ReLU lives only in registers. `makeSpec()` walks the operator chain and determines which tensors are physical (need Args slots) and which are intermediate. The output is a fixed-layout table: `[lhs, rhs, output, D0?, D1?, scale?]`. **Epilogue composition.** Instead of a combinatorial explosion of named patterns (GemmAdd, GemmAddRelu, GemmMulSilu, ...), the epilogue is a composable chain of ops. `{GemmOp, AddOp, ReluOp}` produces `epilogue_ops = {Add, Relu}` with the bias tensor automatically slotted as D0. Two consecutive AddOps fold into a single Add with two D tensors via CK Tile's parameter pack. **Signature/Algorithm split.** The same Signature can pair with multiple GemmAlgorithms to produce different tuning variants without changing the mathematical result. This is the foundation for the dispatcher — one operation description, many tile configurations. ## New types | Type | Role | |------|------| | `GemmSpec` | Validated NTTP kernel descriptor — physical tensors, tile geometry, epilogue chain | | `GemmAlgorithm` | User-facing tuning input — tile sizes, pipeline, partitioning, padding | | `EpilogueOp` | NTTP-compatible projection of the Op variant for epilogue chains | | `Dim3` | M x N x K triple for tile geometry | ## Test coverage - **69 unit tests** — happy paths, layouts, dtypes, quantization, epilogue chains, algorithm variants - **33 compile-fail tests** — one per constraint (tile divisibility, INT8 rules, pipeline restrictions, etc.) - **6 schema compatibility baselines** — frozen specs that break if the schema changes --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
115 lines
4.7 KiB
Markdown
115 lines
4.7 KiB
Markdown
# rocm_ck
|
|
|
|
A C++20 constexpr API for configuring and distributing
|
|
[CK Tile](../include/ck_tile/) GPU kernels across multiple architectures.
|
|
|
|
> **Status**: Early development. Foundation types (DataType, Layout, Args,
|
|
> operators, FixedString, PhysicalTensor, ResolvedTensor), schema engine
|
|
> (Signature, resolve(), ArchProperties), and spec factories (GemmSpec,
|
|
> GemmAlgorithm, makeSpec(), validate) are in place. The device bridge
|
|
> is under active development.
|
|
|
|
## Why rocm_ck exists
|
|
|
|
CK Tile kernels are C++ templates. A GEMM kernel's tile size, pipeline
|
|
strategy, data types, and epilogue are all template parameters — fixed at
|
|
compile time. This is excellent for performance (zero-overhead abstraction,
|
|
full inlining), but it creates a problem for multi-architecture distribution:
|
|
the host program must be compiled separately from device code, and the host
|
|
compiler must never see CK Tile headers.
|
|
|
|
rocm_ck solves this by introducing a **host-device boundary** built on
|
|
constexpr data rather than template parameters:
|
|
|
|
1. **On the host side**, kernel configurations are plain C++20 structs
|
|
(`Signature`, `Algorithm`, `GemmSpec`). These are constexpr data —
|
|
they describe *what* to compute and *how*, without instantiating any
|
|
templates. Host code reasons about kernels using values, not types.
|
|
|
|
2. **On the device side**, a thin bridge layer lowers these constexpr
|
|
descriptions into CK Tile template instantiations. Each `GemmSpec`
|
|
maps to exactly one `ck_tile::GemmPipeline<...>` specialization.
|
|
|
|
3. **At the boundary**, pre-compiled kernels are packaged into
|
|
[kpack archives](https://github.com/ROCm/TheRock/blob/main/docs/rfcs/RFC0008-Multi-Arch-Packaging.md) —
|
|
self-describing, compressed, multi-architecture bundles. The host loads kernels at runtime
|
|
by matching a `GemmSpec` against the kpack table of contents. No
|
|
recompilation, no template instantiation on the host.
|
|
|
|
This separation is what makes CK Tile viable in
|
|
[TheRock](https://github.com/ROCm/TheRock)'s multi-arch build system,
|
|
where a single host binary must work with device code compiled for
|
|
many GPU targets (e.g. gfx90a, gfx942, gfx1151).
|
|
|
|
## The constexpr schema model
|
|
|
|
Traditional GPU kernel libraries select kernels through template
|
|
parameters or runtime enums. rocm_ck uses a third approach: **constexpr
|
|
structs that are validated at compile time and lowered to templates on
|
|
the device side.**
|
|
|
|
A kernel configuration has two axes:
|
|
|
|
- **Signature** — *what* the kernel computes: a directed graph of
|
|
operators (`GemmOp`, `AddOp`, `ReluOp`, ...) connecting named tensor
|
|
slots. Data types, layouts, and batch dimensions are part of the
|
|
signature.
|
|
|
|
- **Algorithm** — *how* the kernel computes it: tile geometry, pipeline
|
|
strategy, warp layout, padding, and scheduling. These are tuning
|
|
parameters that don't change the mathematical result.
|
|
|
|
The `Signature` and `Algorithm` are plain aggregate structs with
|
|
designated initializers — no constructors, no inheritance, no runtime
|
|
polymorphism. Validation happens in `consteval` functions: invalid
|
|
configurations (unsupported tile size, incompatible data types, missing
|
|
tensor slots) fail at compile time with actionable error messages.
|
|
|
|
Here is a preview of the API direction (not yet implemented):
|
|
|
|
```cpp
|
|
// Host side — pure constexpr, any C++20 compiler, no CK headers
|
|
constexpr Signature sig = {
|
|
.dtype = DataType::FP16,
|
|
.ops = {
|
|
GemmOp{.lhs = "A", .rhs = "B", .out = "C"},
|
|
AddOp{.lhs = "C", .rhs = "bias", .out = "D"},
|
|
ReluOp{.in = "D", .out = "E"},
|
|
},
|
|
};
|
|
|
|
// Device side — make_kernel lowers to a CK Tile template instantiation.
|
|
// Compiled separately per architecture, packaged into .kpack archives.
|
|
```
|
|
|
|
## Directory layout
|
|
|
|
```text
|
|
rocm_ck/
|
|
├── CMakeLists.txt # INTERFACE library, C++20, ck_tile_headers target
|
|
├── include/rocm_ck/ # Public headers — host-safe, no CK/HIP deps
|
|
├── src/ # (planned) Device bridge, kpack loading
|
|
└── tests/
|
|
├── CMakeLists.txt # Test tiers: ROCM_CK_SMOKE, ROCM_CK_KERNEL
|
|
├── unit/ # Fast host-only tests (< 1s, no GPU)
|
|
├── compile_fail/ # Static assertion tests — verify invalid configs fail at compile time
|
|
└── kernel/ # (planned) GPU kernel tests
|
|
```
|
|
|
|
## Build
|
|
|
|
rocm_ck is a CK feature, gated by `CK_ENABLE_ROCM_CK`:
|
|
|
|
```bash
|
|
cd composablekernel
|
|
cmake -B build -S . -G Ninja \
|
|
-DCK_ENABLE_ROCM_CK=ON \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
|
|
|
|
ninja -C build smoke-rocm-ck # host-only smoke tests
|
|
ninja -C build check-rocm-ck # all rocm_ck tests
|
|
ctest --test-dir build -L ROCM_CK_SMOKE --output-on-failure
|
|
```
|
|
|
|
Default CK builds (`CK_ENABLE_ROCM_CK=OFF`) are unaffected.
|