Commit Graph

1097 Commits

Author SHA1 Message Date
Qianfeng Zhang
fb89a013b7 Combine minus with scale_s 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
23e80a5964 Move silu calculation to gemm1 iteration and try to interleave gemm_1 and silu 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
f40d68c1a9 Update in using masking for the case where kMasking is false and kPadSeqLenK is true 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
95b9a277ac Fix in generate_instances.py and re-generated the instances 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
e5fb03a4aa Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f() 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
266e7bc8e9 Use kN0=64 to save vgprs 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8f7a97fe02 Fix the script name 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
bace12feac Fix in GetTileRangeAlongX 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
7a7c17802a Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
e731437af1 Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8da21d9cde Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
065776d42d Remove un-needed __builtin_amdgcn_sched_barrier(0) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
63acd4638b Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
58090fe730 Tiny codes simplification in pipeline 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
ec14e9df3e Remove one line of __builtin_amdgcn_sched_barrier(0) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
c0609d49cd Fix the integer overflow in total_flops calculation 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
1efb2a8f38 Add scripts for comparing with triton 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
70b4aa310f Use exp2() to calculate exp() for better performance 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
45019fd5fd Remove the comparing of row/col to max_uih_len in masking 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
ad10a2dd53 Use kM0=128 kN0=64 to completely remove the vgprs spilling 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8b2948b31e Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
fafb375122 Use packed cast_tile for fp16 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
6686c7af44 Update to partially reduce the register spilling 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
459c5565d4 Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier() 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8a6c2591b0 Update the in pipeline codes 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
d360c61200 Fix in calculation of total_flops and update benchmark scripts 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
251136cca7 Add output of estimated TFLOPS 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
644ea27e0e Update to the scripts and error thresholds 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
2a71304bbb Tune the input initialization to avoid over-flow in silu 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
9c2dbf8d64 Add benchmark_hstu_attention.sh 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
cdb0704377 Add several verification test cases 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
beb6fa8cc1 Fix in kernel and forward dispatch for jagged mode 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
24822a4898 Fix in hstu-attention pipeline (which makes some testing cases passed) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
50b0af257c Fixes and updates 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
72774b718b Change in HstBlockMasking and kernel/reference codes for using masking 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
74a0ec4609 Fix and change in example 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
450494945f Add hstu attention kernel implementation, instances and interfaces (building succeeded) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
e6b6323b67 fix the jagged mode tensor access in reference_hstu_attention 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
a19f73c305 Initial reference implementation of hstu attention 2026-06-23 09:17:23 +00:00
Enrico Degregori
2733e75900 [rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)
[CK Tile] Async support pipeline V3

## Motivation

Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)

## Technical Details

- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type

## Test Plan

New test `test_ck_tile_gemm_pipeline_compv3_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 06:57:14 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
damien-lejeune
5bebfd460f [rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06)
Add tile size for FMHA batch prefill bf16 for MI308X

## Motivation

Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16
input type case

## Technical Details

N/A

## Test Plan

Benchmarking from the Aiter side with:

```
python3 op_tests/test_batch_prefill.py  -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile
```

## Test Result

We see an improvement with the new tile size on MI308X (both with PLT
mode OFF and ON)

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-17 06:22:26 +00:00
Sami Remes
c1f7104852 [rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01)
[CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K,
 unsupported-shape guard (#6663)

Three independent MX GEMM correctness bugs reported against
example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one
host-side atomic-add accumulation bug in the example's repeat loop.

- Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop
required num_loop > PrefetchStages, which let num_loop == 3 enter a hot
loop that produced 5 gemm accumulations instead of 3 (K == 3*K_Tile,
e.g. K=768, deterministically wrong). Require num_loop >= 4 instead:
pre-pipeline + TailNumber::Three already totals exactly 3.

- Kernel (gemm_mx_kernel.hpp): split-K was silently broken because
GridSize did not thread k_batch into blockIdx.z and the scale tile
windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore
read the wrong packed scales. Fix:
* GridSize returns dim3(grid_x, 1, k_batch) (persistent and
non-persistent).
* MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a
packed-scale K offset (also apply pad_tensor_view so OOB scale loads
return zero, matching A/B padding).
* operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset
(matches Underlying::SplitKBatchOffset's K1-aligned formula), and
dispatches the epilogue with memory_operation_enum::atomic_add for
k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard
as UniversalGemmKernel.
* MakeCBlockWindows templated on DstInMemOp; unconditionally applies
pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are
handled correctly.

- Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp):
add IsSupportedArgument and a static_assert for configurations that
produce silent wrong results:
* static_assert(!kPadK) -- the MX comp-async pipeline uses
async_load_tile whose OOB check is per-vector-start, so a vector
straddling the K pad boundary reads garbage. Until the async path learns
per-element pad masking, reject kPadK at compile time.
* Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when
kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always
(CShuffleEpilogue cannot safely run with a single partial tile); K %
(KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of
WarpTile_K * k_batch so every split lands on a packed-scale boundary.
  * All error paths log under CK_TILE_LOGGING with actionable messages.

- Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp):
* Call Kernel::IsSupportedArgument up front and throw a clear
runtime_error for rejected shapes (was silently launching an unsupported
kernel).
* Switch to launch_kernel_time_mask with a clear_gemm_output preprocess
that zeroes C between iterations when k_batch > 1 (mirrors
universal_gemm_invoker). Without this the default -warmup=50 -repeat=100
accumulated 150 atomic_adds into C after the kernel-side split-K fix.

Tests (test/ck_tile/gemm_mx/):
- Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true).
- test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression),
SplitK (k_batch=2,4 across full_k/partial_k paths),
TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding}
covering trailing partial tiles along M, N, or both.
- Run(...) now takes k_batch.
- packScalesMNxK: guard against OOB (mn, k) reads from src and
initialise e8m0 bytes to the zero exponent (0x00) instead of the
default-constructed NaN (0xFF), so padded lanes don't poison the packed
int32_t shared with in-range lanes.
- test_mx_gemm_instance.hpp: call IsSupportedArgument before launch.

Verification on gfx950, ROCm 7.2.0:
- ctest -R test_ck_tile_mx_gemm -> 100% (2/2).
- Example sweep over the original bug-report shapes: all K-aligned
shapes now validate correct (including 4096^3 sk=2 and the K=768 cases);
all K=128 shapes cleanly rejected with the new error message instead of
producing silent wrong results.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 08:28:55 +00:00
damien-lejeune
aab1d219f5 [rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1)
Add tile shape for FMHA batch prefill on MI308X (on fp8,
 hdim=256) (#8350)

## Motivation

Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X

## Technical Details

Appending the tile shape to the existing factory such that it can be
picked up by Aiter

## Test Plan

Ran the performance test on both MI300X and MI308X

## Test Result

MI300X performance seems unaffected by this change. MI308X does improve.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-15 07:00:35 +00:00
ltqin
0954a8f3fa [rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc)
[CK_TILE] Add graph capture support for FMHA backward(new
 branch) (#8262)

## Motivation
Add HIP graph capture support for FMHA backward operations. The original
implementation only supported normal execution mode and would cause
use-after-free crashes when used with graph capture replay.
When FMHA backward is captured into a HIP graph:
- First replay: host callback executes and deletes the closure (as
designed for normal mode)
- Subsequent replays: use-after-free crash because the closure was
already freed
This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work
correctly in both normal execution and graph capture modes.
2026-06-14 03:11:53 +00:00
Chao
320a813d67 [rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45)
[CK_TILE] Add host-side Pack-GQA optimization for FMHA
 forward (#6533)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[CK_TILE] Add host-side Pack-GQA optimization for FMHA forward

## Motivation

Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q
tensor
from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio *
seqlen_q, d]`
by adjusting strides, so grouped Q-heads sharing the same KV data are
processed
in a single tile. Zero kernel changes — runner-only.

Phase 1: non-causal attention with GQA ratio packing.
Phase 2: extends to dropout and split-kv paths, fixes stride edge cases.

## Technical Details

Modified files (2):
- `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag
plumbing
- `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape
logic,
  stride adjustment for GQA ratio packing

New files (1):
- `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases
covering
  non-causal, dropout, split-kv, various GQA ratios

## Dependencies

None — this PR is standalone.

## Test Plan

- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048
-d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test
cases:
- Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32
-h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`

## Test Result

Benchmark results (MI350X, gfx950, ROCm 7.0):

| Config | Without Pack | With Pack | Improvement |
|--------|-------------|-----------|-------------|
| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 | 690.05 TFlops (0.199
ms) | 695.61 TFlops (0.198 ms) | +0.8% |
| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 | 706.25 TFlops (0.389
ms) | 729.35 TFlops (0.377 ms) | +3.3% |
| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 | 305.20 GB/s (1.763
ms) | 1813.41 GB/s (0.297 ms) | **+5.9x** |
| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 | 591.70 GB/s
(0.909 ms) | 1820.65 GB/s (0.295 ms) | **+3.1x** |
| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 | 695.16 TFlops | 702.72 TFlops
| no regression |

Benchmark results (MI300X, gfx942, ROCm 6.4.1):

No regression on MI300X. Pack-GQA is a runner-only optimization (zero
kernel changes), performance impact is within noise on MI300X.

| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 336.52 TFlops | 0.408 | -1.7% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 322.52 TFlops | 0.426 |
-0.7% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 349.85 TFlops | 0.786 |
+0.5% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.29 TFlops |
1.442 | +1.2% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 697.32 GB/s | 1.541 |
+0.8% |

All validation tests pass (`valid:y`) on both MI300X and MI350X.

Additional validation:
- 53 parameterized test cases pass (23 phase 1 + 30 phase 2)
- GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1
- No regression on MHA (ratio=1) workloads
- fp16 and bf16 validated
2026-06-10 01:56:44 +00:00
Emily Martins
674f7cdc0e [rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6)
[CK] Remove Stream-K from old CK

## Motivation

Since Stream-K has a CK Tile implementation, we no longer need Stream-K
in old CK. Hence, this PR removes Stream-K from old CK.

## Technical Details

All Stream-K artifacts in old CK have been removed including examples,
tests, kernels, and CK profiler artifacts.

## Test Plan

Ran a CI run on the branch before publishing PR.

## Test Result

All tests passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-08 16:47:26 +00:00
Bartłomiej Kocot
2c363870d9 [rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8)
[Ck][CK Tile] Global Load/Store for Large Tensors support
 (#6744)

## Motivation

Create solution to support large tensors in the entire ck tile.

## Technical Details

- add possiblity to use global load
- int64 indexing

## Test Plan

conv fwd tests

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-913
2026-06-06 10:14:17 +00:00
Yung-sheng Tu
e826b2eb7e [rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR adds two specialisations related to `tf32_t`.

## Technical Details

This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 12:27:41 +00:00
Sami Remes
ad4e2e7624 [rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320)
[CK_TILE] [QuantGEMM] Fix SplitK tail handling and other
 improvements (#7199)

This pull request introduces improved and more robust split-K support
for quantized GEMM. The main changes add runtime validation, utility
functions for split-K batch calculations, pointer offset handling for
split-K in grouped kernels, and enhanced support for various tensor
layouts. The changes also improve error handling and provide more
flexibility for runtime tail handling in split-K pipelines.

**Split-K Support and Validation Enhancements:**

* Added runtime validation to ensure `k_batch` is a positive integer and
that split-K configurations do not produce empty final batches or
mismatched pipeline tails, with detailed error messages and logging for
misconfiguration.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Introduced utility functions `get_splitk_batch_k_read` and
`get_splitk_last_batch_k` to compute per-batch K read sizes and handle
split rounding, ensuring correct and consistent split-K batch
partitioning.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234)
[[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no
split-K) for safer default behavior.

**Pointer Offsets and Grouped Kernel Handling:**

* Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to
all input pointers, mirroring the behavior of non-grouped kernels and
ensuring correctness for split-K launches.
* Modified AQ tensor view handling to correctly reflect the remaining
K-groups from the split-K batch's offset position, improving accuracy
for split-K in grouped kernels.

**Pipeline and Layout Flexibility:**

* Added support for runtime selection of split-K tail handling via a new
template parameter `RuntimeSplitKTail_`, with new helper methods to
dispatch GEMM pipelines accordingly.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629)
[[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641)
* Improved handling for tensor layout cases, including preshuffled B and
both row-major and column-major AQ layouts, ensuring correct pointer
arithmetic and compatibility checks.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
2026-06-05 11:41:49 +00:00