[CK] Add rocm_ck schema engine: Signature, resolve(), ArchProperties (#7179) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary A `Signature` is a directed compute graph: tensors are nodes, operators are edges. Shared names between operator outputs and inputs form the graph structure. `resolve()` walks this graph at compile time (`consteval`), inferring dtype, rank, and layout for every tensor — invalid configs become compiler errors, not runtime crashes. **Key design decisions:** - **Operators teach the system about tensors.** `GemmOp` implies rank 2 and Row/Col/Row layout. `AddOp` and `ReluOp` propagate from connected slots. The dtype cascade fills in the rest: per-tensor → signature-wide → error. - **Adding a new op is zero lines in the resolution engine** if it's structurally binary (`lhs/rhs/out`) or unary (`in/out`) — C++20 concepts handle dispatch automatically. Only ops with special semantics need explicit branches. - **TargetSet is a compile-time bitset over GPU targets.** The wave tile validation table is the single source of truth for valid instruction shapes, traced from CK Tile's WarpGemmDispatcher. FP8 tiles are available on gfx942+ via IterateK composition, not gfx950-only. **Reading order:** signature.hpp (data model) → arch_properties.hpp (TargetSet, wave tiles) → resolve.hpp (resolution engine). 3 new headers, 3 unit tests (including diamond DAG coverage), 3 compile-fail tests. Introduces tests/compile_fail/ infrastructure. **Stack**: PR 2 of 3 porting the rocm_ck constexpr schema from experimental to production. 1. Foundation types — DataType, Layout, Args, Ops (#7114) 2. **This PR** — Schema engine (graph resolution) 3. Spec factories — GemmSpec, makeSpec() (#7180 ) Note: We also removed `FmhaBwdOp` for clarity, since that was introduced early and doesn't have tests set up. **Depends on**: #7114 ## Test plan - [x] ctest --test-dir build --output-on-failure — unit tests + compile-fail tests pass - [x] Compile-fail tests correctly reject: mixed CDNA+RDNA TargetSet, conflicting layouts, empty quantization scale names
rocm_ck
A C++20 constexpr API for configuring and distributing CK Tile GPU kernels across multiple architectures.
Status: Early development. Foundation types are in place (DataType, Layout, Args, operators, FixedString, PhysicalTensor, ResolvedTensor). The schema engine (Signature, resolve(), Algorithm) and device bridge are under active development.
Why rocm_ck exists
CK Tile kernels are C++ templates. A GEMM kernel's tile size, pipeline strategy, data types, and epilogue are all template parameters — fixed at compile time. This is excellent for performance (zero-overhead abstraction, full inlining), but it creates a problem for multi-architecture distribution: the host program must be compiled separately from device code, and the host compiler must never see CK Tile headers.
rocm_ck solves this by introducing a host-device boundary built on constexpr data rather than template parameters:
-
On the host side, kernel configurations are plain C++20 structs (
Signature,Algorithm,GemmSpec). These are constexpr data — they describe what to compute and how, without instantiating any templates. Host code reasons about kernels using values, not types. -
On the device side, a thin bridge layer lowers these constexpr descriptions into CK Tile template instantiations. Each
GemmSpecmaps to exactly oneck_tile::GemmPipeline<...>specialization. -
At the boundary, pre-compiled kernels are packaged into kpack archives — self-describing, compressed, multi-architecture bundles. The host loads kernels at runtime by matching a
GemmSpecagainst the kpack table of contents. No recompilation, no template instantiation on the host.
This separation is what makes CK Tile viable in TheRock's multi-arch build system, where a single host binary must work with device code compiled for many GPU targets (e.g. gfx90a, gfx942, gfx1151).
The constexpr schema model
Traditional GPU kernel libraries select kernels through template parameters or runtime enums. rocm_ck uses a third approach: constexpr structs that are validated at compile time and lowered to templates on the device side.
A kernel configuration has two axes:
-
Signature — what the kernel computes: a directed graph of operators (
GemmOp,AddOp,ReluOp, ...) connecting named tensor slots. Data types, layouts, and batch dimensions are part of the signature. -
Algorithm — how the kernel computes it: tile geometry, pipeline strategy, warp layout, padding, and scheduling. These are tuning parameters that don't change the mathematical result.
The Signature and Algorithm are plain aggregate structs with
designated initializers — no constructors, no inheritance, no runtime
polymorphism. Validation happens in consteval functions: invalid
configurations (unsupported tile size, incompatible data types, missing
tensor slots) fail at compile time with actionable error messages.
Here is a preview of the API direction (not yet implemented):
// Host side — pure constexpr, any C++20 compiler, no CK headers
constexpr Signature sig = {
.dtype = DataType::FP16,
.ops = {
GemmOp{.lhs = "A", .rhs = "B", .out = "C"},
AddOp{.lhs = "C", .rhs = "bias", .out = "D"},
ReluOp{.in = "D", .out = "E"},
},
};
// Device side — make_kernel lowers to a CK Tile template instantiation.
// Compiled separately per architecture, packaged into .kpack archives.
Directory layout
rocm_ck/
├── CMakeLists.txt # INTERFACE library, C++20, ck_tile_headers target
├── include/rocm_ck/ # Public headers — host-safe, no CK/HIP deps
├── src/ # (planned) Device bridge, kpack loading
└── tests/
├── CMakeLists.txt # Test tiers: ROCM_CK_SMOKE, ROCM_CK_KERNEL
├── unit/ # Fast host-only tests (< 1s, no GPU)
├── compile_fail/ # Static assertion tests — verify invalid configs fail at compile time
└── kernel/ # (planned) GPU kernel tests
Build
rocm_ck is a CK feature, gated by CK_ENABLE_ROCM_CK:
cd composablekernel
cmake -B build -S . -G Ninja \
-DCK_ENABLE_ROCM_CK=ON \
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
ninja -C build smoke-rocm-ck # host-only smoke tests
ninja -C build check-rocm-ck # all rocm_ck tests
ctest --test-dir build -L ROCM_CK_SMOKE --output-on-failure
Default CK builds (CK_ENABLE_ROCM_CK=OFF) are unaffected.