[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-05 14:11:29 +00:00

[CK_TILE] Add SageAttention v2 forward kernel with
 multi-granularity quantization (#6574)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Add a CK_TILE forward kernel implementing [SageAttention
v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that
applies multi-granularity quantization to Q/K/V before computing
attention, trading minimal accuracy loss for higher throughput on
low-precision hardware.

### Quantization design

| Tensor | Supported data types | Scale granularity options |
|--------|---------------------|--------------------------|
| Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(32 tokens), per-thread (4 tokens) |
| K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(64 tokens), per-thread (16 tokens) |
| V | fp8 | per-channel (always) |
| O | bf16 | — |

Three precision combinations are supported: `fp8/bf16` (QKV fp8, O
bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK
int4, V fp8, O bf16).

### Architecture support

- **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set
- **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for
fp8-family dtypes)

### Implementation

- Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC`
(async copy)
- Masking support: no mask, causal (top-left / bottom-right), and
generic windowed
- Batch and group (variable-length) modes
- Head dimension: d=128, d_v=128
- Python codegen under `example/ck_tile/49_sageattention/codegen/`
generates kernel instances per target/dtype/tile combination
- Smoke tests included via `tile_example_sageattn_fwd`

### Test commands

\`\`\`bash
# fp8 QKV
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=fp8bf16 -qscale=3 -init=3

# int8 QK, fp8 V
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=i8fp8bf16 -qscale=3 -init=3
\`\`\`

\`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread

This commit is contained in:

ltqin

2026-04-30 18:33:36 +00:00

committed by

assistant-librarian[bot]

parent e8d64ad5c6

commit de0a61e5c2

30 changed files with 7809 additions and 0 deletions

									
										1

example/ck_tile/CMakeLists.txt
									
												View File
												
				@@ -31,6 +31,7 @@ add_subdirectory(38_block_scale_gemm)

				add_subdirectory(40_streamk_gemm)

				add_subdirectory(41_batched_contraction)

				add_subdirectory(42_mx_gemm)

				add_subdirectory(49_sageattention)

				add_subdirectory(50_sparse_attn)

				add_subdirectory(51_tile_distr_enc_reg_map)

				if(BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS)

[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)

1 example/ck_tile/CMakeLists.txt Unescape Escape View File

1

example/ck_tile/CMakeLists.txt

View File