984 Commits

Author SHA1 Message Date
Hosang
859acb5ae7 [rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6)
[CK_TILE] Add LLC-aware FMHA head grouping and head-major
 scheduling on RDNA (#5018)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:19:23 +00:00
Enrico Degregori
eb033ef208 [rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a)
[CK Tile] Eight Waves pipeline GEMM

## Motivation

Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM

## Technical Details

Summary:
 - Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
   - Use this block struct in ABQuant for encodings
 - Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
 - Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves

So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts

## Test Plan

Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).

Note: K padding test is disabled for this pipeline because it's not
implemented yet

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 08:31:56 +00:00
Bartłomiej Kocot
b8108662da [rocm-libraries] ROCm/rocm-libraries#5387 (commit 0c259bd)
[CK][CK Tile] Grouped Convolution Backward Weight set of
 fixes (#5387)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Grouped Convolution Backward Weight split k fixes for CK tile kernels

## Technical Details

- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

Pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 16:19:50 +00:00
Yi DING
574c1c121a [rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21)
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic
 Mode (#5174)

## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 06:14:31 +00:00
Michał Kulikowski
2c3f9bfa52 [rocm-libraries] ROCm/rocm-libraries#5348 (commit 7b18234)
[CK][Examples] Adding parameters for a couple of CK examples:
 -gemm_add_add_mean_meansquare_xdl_fp16 -gemm_dl_quantization_int8
 -gemm_xdl_bias_relu_quantization_int8 -gemm_xdl_quantization_int8

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-03-12 08:48:36 +00:00
Aviral Goel
1a4aa7fd89 [rocm-libraries] ROCm/rocm-libraries#5082 (commit 9313659)
ck_tile: add gtest unit tests for MX flatmm (gfx950)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add correctness unit tests for the MX-format flatmm kernel
(`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/`
- Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6,
FP8×FP4, FP4×FP8
- Tests cover all four kernel dispatch paths (the `has_hot_loop` ×
`tail_num` product):
  - `has_hot_loop=false, tail=ODD` (K=256, num_loop=1)
  - `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2)
  - `has_hot_loop=true, tail=ODD` (K=768, num_loop=3)
  - `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4)
- Remove unsupported `-split_k` CLI option from
`tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with
K-splitting and the option silently produced wrong results

## Changes

**New files (`test/ck_tile/flatmm/`):**
- `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT
library, links into 5 per-dtype test executables; forwards
`-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON
- `test_mx_flatmm_base.hpp` — base test fixture with
`run_test_with_validation(M, N, K, kbatch=1)`
- `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test
class and type aliases
- `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype
`TYPED_TEST_SUITE` files

**Modified files:**
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved
`preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by
both the example and the tests
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc`
— removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split`
formula, updated call sites after `preShuffleWeight` move
- `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)`
2026-03-11 22:47:59 +00:00
Anton Gorenko
2312eef6c3 [rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
 gfx950 (#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 10:00:52 +00:00
Sami Remes
8f27f65d44 [rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb)
[CK_TILE] MX GEMM non-preshuffled RCR layout

## Motivation

Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 20:12:43 +00:00
Hosang
c800f88911 [rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523)
[CK_TILE] Update gfx11 FMHA forward kernel configs

## Motivation
Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded
seqlen_q/k) cases.
This tuning is based on heuristic search and improves performance in
most tested shapes.
Performance should be evaluated on top of
[`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018)
(required baseline).

## Technical Details

  - Updated gfx11 codegen heuristic choices for tile size and occupancy.
   - Updated gfx11 pipeline selection:
- Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently
slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad
cases are dispatched to the faster kernel path.`
- Kept gfx12 unchanged: with PSSK support from
[`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957),
existing gfx12 config is already sufficient.
  - Tuning rationale:
    - In some cases, higher `kBlockPerCu` lowers register pressure.
- On RDNA, this generally aligns with better performance when
`waves_per_eu >= 6`.

## Test Plan
- test_ck_tile_fmha
- tile_example_fmha_fwd: tested this on gfx1100 and gfx1151
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24
-d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result
- TFLOPs by sequence length target: `gfx1100` layout: `bhsd`
- mode: batch / VGPR usage: 225 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 74.10 | 71.97 | 0.97x
4096 | 66.26 | 77.79 | 1.17x
8192 | 68.18 | 75.88 | 1.11x
12288 | 68.47 | 80.44 | 1.17x
16384 | 59.54 | 79.66 | 1.34x
20480 | 55.78 | 77.91 | 1.40x
24576 | 55.08 | 77.47 | 1.41x
27280 | 47.45 | 77.16 | 1.63x
- mode: group / VGPR usage: 256 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 71.47 | 70.6 | 0.99x
4096 | 64.74 | 77.06 | 1.19x
8192 | 64.68 | 75.47 | 1.17x
12288 | 66.43 | 79.95 | 1.20x
16384 | 56.02 | 79.73 | 1.42x
20480 | 50.21 | 78.15 | 1.56x
24576 | 47.29 | 77.53 | 1.64x
27280 | 46.13 | 77.04 | 1.67x

- TFLOPs by sequence length target: `gfx1151` layout: `bshd`
- mode: batch / VGPR usage: 225 vs 223

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 26.85 | 29.17 | 1.09x
4096 | 24.75 | 26.01 | 1.05x
8192 | 25.24 | 25.50 | 1.01x
12288 | 25.18 | 25.00 | 0.99x
16384 | 24.79 | 25.91 | 1.05x
20480 | 25.56 | 25.24 | 0.99x
24576 | 25.13 | 26.20 | 1.04x
27280 | 10.78 | 26.35 | 2.44x
- mode: group / VGPR usage: 256 vs 229

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 27.44 | 26.71 | 0.97x
4096 | 21.89 | 23.09 | 1.05x
8192 | 22.85 | 24.49 | 1.07x
12288 | 24.33 | 24.42 | 1.00x
16384 | 20.05 | 24.98 | 1.24x
20480 | 14.70 | 25.15 | 1.71x
24576 | 11.30 | 26.31 | 2.33x
27280 | 10.10 | 26.32 | 2.61x

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 16:47:43 +00:00
kensclin
8c216604d4 [rocm-libraries] ROCm/rocm-libraries#5218 (commit 60156cf)
[CK] Fix the issue of the aiter to call eightwarps pipeline.
 (#5218)

## Motivation

Fix the failure of the aiter to call eightwarp.
Changed Async to the name eightwarps.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

Pass

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-09 18:13:07 +00:00
Christopher Millette
e0d11b969b [rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26)
[CK] Replace tuple value construction with tuple_element_t
 type extraction [1A] (#5030)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Rationale
CK's device operation instance registration uses
`add_device_operation_instances` at ~1,850
call sites to register GPU kernel configurations. The existing
implementation constructs
`std::tuple` values just to extract their types via `decltype`, then
copy-constructs each
instance into `make_unique`. This is wasteful — only the types matter,
not the values — and
forces the compiler to instantiate the full `std::tuple` constructor and
`std::get` machinery
at every call site.

### What changed
- Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with
`std::tuple_element_t<i.value, TupleType>`, which extracts the type
directly without constructing any values
- Replace copy-from-default `make_unique<T>(value)` with direct default
construction `make_unique<T>()` — all CK device operation instances are
stateless structs with configuration encoded in template parameters
- Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to
enforce this contract at compile time with a clear error message
- Add Doxygen documentation for this high-traffic public API

### Value
- Eliminates unnecessary template instantiation of `std::tuple`
constructors and `std::get` across ~1,850 call sites
- Establishes a cleaner, more intention-revealing pattern for type-only
tuple usage
- The `static_assert` prevents silent breakage if a
non-default-constructible type is ever added
- No runtime behavior change — zero risk

### Files changed (9)
- `add_device_operation_instance.hpp`: Core pattern change
- 3 example files, 3 reduce instance headers, 1 convolution header, 1
profiler header

## Test plan
- [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce,
softmax, convolution)
- [ ] `static_assert` provides compile-time validation stronger than
runtime tests
- [ ] No runtime behavior change — stateless struct default construction
is identical to copy-from-default
- [ ] Compatible with both `std::tuple` and `ck::type_list` containers

🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:28:22 +00:00
Ville Pietilä
ae4e632c7d [rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)
[CK_TILE] Add CK Tile bwd weight profiler
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.

## Technical Details

- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
2026-03-04 21:50:29 +00:00
Anton Gorenko
782f0a9eed [rocm-libraries] ROCm/rocm-libraries#4957 (commit d2e4eb8)
[CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12
 (#4957)

## Motivation

Build pipelines with seqlen padding only to support vectorized loads in
the hdim dimension.
The existing pipelines have either all dims padded or all dims not
padded.
These pipelines can be used in ComfyUI for slightly better performance.

## Technical Details

Also a fix included for correct FLOPS calculation in
`tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t
capacity (signed int32).

## Test Plan

The existing test cases will use the new pipelines when parameters allow
(seqlens - padded, hdims - not padded):
```
ninja test_ck_tile_fmha_fwd

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16

bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12
```

## Test Result

All tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 04:50:44 +00:00
Yi DING
b09112bbad [rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface

## Motivation
Reduce memory usage; Be prepared to implement optimizations of reducing
nsplits in deterministic cases.

## Technical Details
This PR introduces a new launcher interface for the FMHA backward
operation, replacing direct function calls with a more structured
approach. The launcher encapsulates kernel dispatch logic and provides
access to computed metadata like the number of dQ acc splits.

**Changes:**
- Added `fmha_bwd_launcher` class that wraps kernel execution and
exposes `dq_acc_splits`
- Moved `fmha_bwd_traits` construction earlier in the execution flow to
support launcher initialization
- Refactored code generation to produce both legacy API and new launcher
constructor

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 01:21:07 +00:00
Brock Hargreaves
2a16d53cce [rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)
[CK] Address a bunch of errors associated with targeting
 gfx1200 on Windows (#5045)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Still addressing errors that are blocking the merge of TheRock PR:
https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382

## Technical Details

1. There are multiple fmha python scripts that are writing native paths
which are confusing cmake. I addressed one of these in an earlier PR
https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing
more that are exposed with gfx1200 target:

```
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

2. In the following compiler error we see gemm_prec_str<ADataType,
BDataType> being passed as a function to concat(...), instead of being
evaluated with the parenthesis operator(), i.e.,
gemm_prec_str<ADataType, BDataType>(). There are multiples instances of
this, I wonder what non-msvc compilers do here:

```
[composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast]
[composable_kernel]   119 |     ((oss << sep << rest), ...);
[composable_kernel]       |                     ^~~~
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here
[composable_kernel]   248 |         return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName());
[composable_kernel]       |                ^
```

There are plenty of other places where we use gemm_prec_str with the
operator(), so I'm pretty sure these were just typos...but I'd like some
eyes on it.

3. There are 2 tests that fail to build on Windows, which I've excluded
from the build but will open bug tickets for:
    1.  gemm_weight_preshuffle
    2.  grouped_gemm_preshuffle

Here's a sample of the compiler error for these tests:

```
[composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe  -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations]
[composable_kernel]   110 |         const char* vp = std::getenv(name);
[composable_kernel]       |                               ^
[composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here
[composable_kernel]  1183 |     _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s)
[composable_kernel]       |                    ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE'
[composable_kernel]   368 |         #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT(    \
[composable_kernel]       |                                                       ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT'
[composable_kernel]   358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text))
[composable_kernel]       |                                               ^
[composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation)
[composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918)
[composable_kernel] Target: x86_64-pc-windows-msvc
[composable_kernel] Thread model: posix
[composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin
[composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s).
[composable_kernel] ninja: build stopped: subcommand failed.
[composable_kernel FAILED WITH CODE 1 in 238 seconds]
ninja: build stopped: subcommand failed.
```

## Test Plan

Wait for internal CI and make sure build compiles locally.

## Test Result

Waiting on CI

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 21:55:14 +00:00
Kiefer van Teutem
b042e1805a [rocm-libraries] ROCm/rocm-libraries#4804 (commit 832dd0e)
Add Tile Distribution Encoding Register Mapping debug utility
 for MFMA / WMMA unification work. (#4804)

## Motivation

This PR adds a small utility that allows you to use Tile Distribution
Encodings to directly map matrix elements to register locations and vice
versa. It can also print forward and backward layout mappings similar to
the Matrix Calculator utility. The utility is not meant for index
calculations in actual kernels, but rather as a debugging tool and
probably for automated verification of the policy structs in the new
WMMA / MFMA unification design.

## Technical Details

Tile Distribution Encodings are a core part of CK Tile which can define
the relationship between register and intrinsic matrix fragment
elements. They allow for any mapping based on unmerge and merge
transformations. Also, they allow for a special "Repeat" dimensions
which acts like an additional matrix dimension and allows for
replication of certain matrix elements. The new mapping utility can deal
with all aspects.

## Test Plan

Since this is a debug utility there is nothing to directly test, but
there is an example file that defines four different Tile Distribution
Encodings and prints their forward and backward mappings, along with
some extra parameters.

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 16:42:42 +00:00
Linjun-AMD
78ae3835a6 [rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66)
[CK] Fix gptoss sink

## Motivation

This PR removes conditional logic for handling infinity values in the
sink mechanism across multiple FMHA pipeline implementations, defaulting
sink_size to 0 and adding a constraint in the kernel selection logic.

## Technical Details

Changes:

Removed __builtin_isinf_sign(sink_v) checks and conditional
initialization of LSE accumulators across 7 pipeline files
Added default initialization (= 0) for sink_size in 4 argument structs
Added F_sink == "f" constraint to kernel compatibility checking

## Test Plan

Local test

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 01:54:46 +00:00
Andriy Roshchenko
b661eab573 [rocm-libraries] ROCm/rocm-libraries#4821 (commit 9456e0f)
[CK TILE] Refactor MX FLATMM example

Refactor the MX FLATMM example to support more pipelines
across different architectures. This work facilitates the NPI team
roadmap.
2026-02-27 23:21:39 +00:00
Aviral Goel
c8a8449eec [rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)
[CK] Add split-K support for ABQuantGrouped in
 block_scale_gemm (#4816)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Changes

### Split-K support in `gemm_quant_kernel.hpp`

- **`SplitKBatchOffset`**: Added `aq_group_offset` and
`aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B)
to track each split-K batch's position within the AQ scale tensor. For
`ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided
by `AQuantGroupSize::kK`.

- **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter
(defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group
dimension reflects only the remaining K-groups from the split-K offset,
consistent with how `MakeBQBlockWindow` handles the BQ tensor.

- **`RunGemm`**: Threads the `aq_k_split_offset` through to
`MakeAQBlockWindow` when in split-K mode.

### Constraints in `IsSupportedArgument()`

Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped:

1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no
preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant
mode with `k_batch > 1` returns `false`.

2. **B quant group alignment** — `KRead` (per-batch K slice) must be
divisible by `BQuantGroupSize::kK`. Each batch must operate on complete
B quantization groups; a partial group would require splitting a scale
value across batches.

3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must
also be divisible by `AQuantGroupSize::kK` for the same reason applied
to the AQ scale tensor.

4. **Minimum 2 K-tile iterations per batch** (new) — The
software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead,
so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When
`KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch
reads into the next batch's memory region and produces incorrect
results. Configurations where `K == k_batch * KPerBlock` are therefore
rejected.

### Example update (`run_gemm_quant_example.inc`)

Updated the comment above the `IsSupportedArgument` call to document
that split-K is now supported for both `BQuantGrouped` (no preshuffle)
and `ABQuantGrouped` (no `APreshuffleQuant`).

## Unit Tests

Two new test files covering decode and prefill tile shapes across a
range of `k_batch` values (2–8), data types (FP8, BF8), and quantization
group sizes (1×1×128 and 1×128×128 for B):

- `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile
shape (M=16, N=64, K_tile=256)
- `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile
shape (M=128, N=128, K_tile=128)

Each test calls `run_test_with_validation` which runs the kernel and
checks correctness against a CPU reference. Configurations excluded from
tests are annotated with comments explaining which constraint they
violate (typically the `per_batch_num_loop >= 2` requirement).

## Prerequisites

This PR depends on #4429, which must be merged before this can be
merged.
2026-02-26 23:57:17 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
Bartłomiej Kocot
eede24de0d [rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7)
[CK] Small improvements for grouped conv backward weight
 (#4872)

## Motivation

Improvements for CK Tile convolution builder run function and atol/rtol
calculations.

## Technical Details

- Add preprocessing function for wrw when k_batch is larger than 1 for
builder run function
- Divide num acums by number of groups to get real number of accums

## Test Plan

CI wrw tests

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-783
2026-02-25 20:11:01 +00:00
Brock Hargreaves
c90a363589 [rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd)
[CK] Use as_posix() instead of str() for paths in
 fmha_fwd_appendkv.py (#4812)

## Motivation

This is causing a failing PR for Windows:
https://github.com/ROCm/TheRock/pull/3382
```

[composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

## Technical Details

The file:
[fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78)
writes a bunch of paths to a text file which is later parsed by cmake.
When passing a pathlib.Path to str(), str() converts to a native path,
in this case / to \\ on Windows which confuses cmake. In this case we
need to write paths with forward slashes and then pass those onward to
cmake.

## Test Plan

1. Ensure this doesn't impact existing CI.
2. Ensure compilation of Windows pass locally.

## Test Result

1. Passes existing CI
2. This fixes the compilation error locally.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:56 +00:00
Brock Hargreaves
abf13bdec1 [rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)
[CK] Fix windows build issues

## Motivation

Full build on Windows is currently broken due to compiler errors, this
PR should help fix that. This is also holding up the following PR in the
TheRock: https://github.com/ROCm/TheRock/pull/3382

## Technical Details

1. I don't see a good reason to be nesting a windows include inside the
ck_tile namespace. It was causing compiler errors too: Windows.h comes
with min and max, which was conflicting with ck_tile::min and
ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this
inclusion in the future.
2. The TRUE/FALSE macros are already used by Windows.h, which causes an
error. So I've opted for True/False. You can see this pattern in other
rocm-libraries.
3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN
context, from \<cmath\> on Windows. We'll be able to use
std::numbers::v_pi\<float\> when we have C++20 support.
4. There was a missing \<chrono\> include.

## Test Plan

Test locally and make sure this doesn't impact existing CI.

## Test Result

Compiles locally and passes existing ci.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:13 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Enrico Degregori
4c626aeaa6 [rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant
 (#4267)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-24 17:57:02 +00:00
Emily Martins
fc3180120e [rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum

## Motivation

Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.

## Technical Details

Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic        = 0u,
    Reduction     = 1u,
    TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic = 0u,
    Linear = 1u,
    Tree   = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan

No new functionality was added, so no new tests were added; I just
validated existing tests and examples.

## Test Result

All tests passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 06:41:15 +00:00
Anton Gorenko
0d92fffedb [rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11

## Motivation

Add support of gfx11 architectures (RDNA3) to FMHA.

## Technical Details

Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).

Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.

Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.

## Test Plan

```
ninja test_ck_tile_fmha

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```

## Test Result

All tests must pass (some tests may be skipped).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-21 01:15:57 +00:00
Illia Silin
1915cdfcc2 [rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in
 GridwiseGemm_wmma_cshuffle_v3" (#4762)

Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:

In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:

/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
  553 |                 if(!GridwiseGemm::CheckValidity(
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-02-20 22:41:34 +00:00
linqunAMD
29781f2ac4 [rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3
 (#4638)

1. Add GetEstimateVgprCount to estimate the VGPR usage in
GridwiseGemm_wmma_cshuffle_v3
2. Add IsValidCompilationParameter to disable kernel which use too many
vgprs.
- Currently, the threashold is AvailableVgprCount * 1.25
3. Modify examples to avoid test is disabled on gfx11

It is port from internal repo
PR[#192](https://github.com/ROCm/composable_kernel/issues/192)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-20 15:57:18 +00:00
Márton Bidlek
7b97e197ef [rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.

The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.

At this stage, the focus is functional correctness and compatibility,
not performance tuning.

## Technical Details

- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-19 08:13:46 +00:00
Thomas Ning
5cb8109535 [rocm-libraries] ROCm/rocm-libraries#4640 (commit 37b8c81)
Fix the Composable Kernel CI and versions incompatibility
 (#4640)

## Motivation

This PR has 4 patches:
1. Fix the CI error of grouped gemm.
2. Fix the incompatibility of old linux version.
3. Fix the potential errors of flatmm.
4. Address the previous comments of abquant eight warps pipeline
solution.
2026-02-18 15:00:26 +00:00
Cong Ma
d06f35027a [rocm-libraries] ROCm/rocm-libraries#4354 (commit d41f08a)
[CK TILE] fix numerical errors of preshuffle_b

This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.

# The numerical issue

## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```

# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.

**Key changes include:**

### Data layout and shuffling logic

* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)

### Test infrastructure enhancements

* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-11 07:05:46 +00:00
Erwin Terpstra
b41bfece83 [rocm-libraries] ROCm/rocm-libraries#4268 (commit d2fca53)
[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant
 pipeline (#4268)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
2026-02-10 13:59:03 +00:00
Yi DING
d5acfd8d52 [rocm-libraries] ROCm/rocm-libraries#4451 (commit 091bf0f)
[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation

## Motivation
This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers
to compile across multiple GPU architectures by introducing compile-time
availability gating and a new attribute tag mechanism for kernel
symbol/attribute specialization.

## Technical Details
- Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline
and propagate availability handling into `QuantGemmKernel`.
- Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling
per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and
unique symbols.
- Update the blockscale GEMM quant example to pass kernel attributes and
adjust gfx950 gating.

## Test Plan
- CI
- Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950'
-GNinja && ninja tile_example_gemm_quant`
- Local test with ROCm/aiter#1954
## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-10 12:42:19 +00:00
Chinmay Dattanand Kuchinad
0cafa68b6f [rocm-libraries] ROCm/rocm-libraries#4292 (commit b7f1367)
Enable group mode (varlen) kernel generation for PyTorch
 integration (#4292)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR enables group mode (variable-length attention) kernel generation
for PyTorch's CK SDPA backend.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

The change is minimal (single line deletion) but enables a significant
feature: variable-length attention support for ROCm users via PyTorch's
torch.nn.attention.varlen API.
2026-02-09 20:59:55 +00:00
kensclin
5b3e527c88 [rocm-libraries] ROCm/rocm-libraries#4280 (commit b7de1e1)
[CK_TILE] Add blockscale GEMM support for EightWarps on
 gfx950 (#4280)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

gemm blockscale eightwarps support

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-09 03:55:52 +00:00
jakpiase
731afe535a [rocm-libraries] ROCm/rocm-libraries#4357 (commit ff3e982)
[CK_TILE] Add support and tests for V6 pipeline in conv fwd
 (#4357)

Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
2026-02-08 19:57:53 +00:00
Jeff Huang
7b18f5fed2 [rocm-libraries] ROCm/rocm-libraries#4263 (commit f34aec2)
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement per-page K/V quantization for paged attention:
  - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
  - Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations

## Proposed changes

Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-04 23:26:20 +00:00
Illia Silin
569640dc70 Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)" (#3705)
This reverts commit 301eb5cf08.
2026-02-03 09:52:14 -08:00
Aviral Goel
3e77721755 feat: add split_k support for block scale gemm bquant mode. (#3653)
* WIP: add splitk to bquant

* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types

* chore: remove temporary test script

* fix: incorrect tile window length for splitted bq tensor window

* chore: improve comments

* test: add unit tests to cover bquant splitk functionality

* fix: conflict resolution by renaming variables
2026-02-02 14:41:53 -08:00
Zoltán Lakatos
301eb5cf08 Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)
* device struct implementation

* added xdl grouped multi abd fixed nk testing

* wmma implementation fixed

* avoid unnecessary device mem allocation and code cleanups

* cleanup instances definitions

* wmma examples added

* code cleanups

* fix clang format

* typo and compilation fixes related to reference gemm

* fix compilation error due to std::remove_cvref_t

* added missing hip_check_error includes

* correction to example instances

* review commentes addressed

* removed split-k from testing

* code formatting

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 13:58:11 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
jiangyon.ren
4d2f8c111e [CK_TILE][FMHA] Add sparse attention VSA (#3341)
* add sparse attention VSA

* fix the pre-commit

* Add jenga test and pre-commit

* add bf16 for vsa

* add jenga support bf16

* remove lse arg

* split kernel code to block & kernel

* fix the pre-commit

* fix the pre-commit

* fix the copyrights

* fix the copyright

* fix the copyright & rename block to pipeline

* fix the copyright and pipeline

* remove lse & dropout & add fmt

* fix the jenga&VSA code review

* remove the useless code & resolved the comments

* remove useless code

* remove useless code

* Clean up code

* Remove more unused code

* Re-format .hpp

* Refactor codegen scripts

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2026-01-31 00:59:47 +08:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
vivienfanghuagood
f3d8b7210f Extend CK fmha_batch_prefill kernel coverage to head_dim=256 (#3328)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-30 11:18:20 +08:00
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
ltqin
654bec3362 Fix block scale init value (#3666)
* Make blockscale descale range adaptive to data type max value

* format
2026-01-28 12:37:15 -08:00