92 Commits

Author SHA1 Message Date
Kiefer van Teutem
2089713f94 [rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?=
 =?UTF-8?q?=20WarpGemmParams=20(#8227)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Recently, the way flags are sent down to the intrinsics was changed in
CK Tile. At the point where the WarpGemm is invoked, an arbitrary number
of template parameters can be passed, and these are passed down all the
way to the lowest level intrinsics wrappers. Here
`WarpGemmParamsParser<>` is used to extract flags for the intrinsics.

In this MR we adapt the the unification framework (amdgcn_mma struct and
MmaPipelines) to work in the same way. By doing this, there is no longer
a point in our custom intrinsic Flag structs, so these are removed.

Unrelated but I also tried removing the MmaPipeline flags because they
arn't used for anything except CTranspose, which is already available.
This also make test_amdgcn_mma_pipeline completely redundant so removed
that as well.
2026-06-26 12:00:58 +00:00
Kiefer van Teutem
137f2a9a10 [rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05)
[CK TILE] Initial integration of MFMA / WMMA unification
 framework into CK Tile (#7407) (locked behind flag)

Note: Everything works but this is still a draft MR because I want to do
some more cleanup and maybe do some testing for MX fp6. Also please
don't trigger copilot, I will do this once I feel it is clean enough,
otherwise I'll get a bunch of comments about stuff I already know.

## Motivation
The point of this MR is to finally use our unification MmaPipelines to
replace the existing WarpGemms in CK tile and make sure everything
works. I focused on gfx908 and gfx950 for now, dense and scale
intrinsics, fp16, fp8, and fp4. I managed to get CK tests / examples
working for all of these scenarios, so the basic implementation should
be correct. I expect some more tweaks will be required to get full
support, some of which I already anticipated in the section "New
issues".

## Big switch: USE_NEW_UNIFIED_FRAMEWORK
When USE_NEW_UNIFIED_FRAMEWORK is 1, we replace all WarpGemms with
MmaPipelines from the new unified framework. This means
WarpGemmDispatcher will use the UnificationDispatcher instead of the
regular Dispatcher. Furthermore, named WarpGemms like
WarpGemmMfmaF32F32F32M16N16K4 will also get rerouted to the
UnificationDispatcher. The latter is necessary because some pipelines
bypass the WarpGemmDispatcher in favor of directly using named
WarpGemms.

For now the switch is turned on for easier testing, so don't expect the
CI to pass. When off, this MR should not affect any of the CK tile tests
at all so I *would* expect the CI to pass.

## Simplification of MmaPipelineBase
I found that the structure of MmaPipelineBase was a bit complex and I
was able to reduce it a lot. The only thing an Mma Pipeline does
(currently) is provide a wrapper around amdgcn structs that allows k
iteration and sparse compression. We don't allow M and N composition for
now for simplicity and since this is not expected from WarpGemms in CK
Tile currently.

## Re-interpretation of tile distribution encodings for packed datatypes
Tile distributions for packed types are expected to describe
mathematical elements, not datatype elements! This distinction is why
the gfx950 fp4 CK_tile tests were not working. Updated the
interpretation in amdgcn_mma, tile distribution calculator, and layout
test, along with comments. Tested on all architectures.

## getCMakeCompilerTarget() for configuration time target architecture
This is a workaround because there are a lot of cases in CK Tile where
the host code inspects Device constructions like WarpGemm, and we need
to get the version that *will* be used on the device. This is a big
kludge and we need to figure out a better solution. Also this util will
always pick the *first* cmakelists target arch, so there will be issues
when compiling for multiple target architectures. Ideally, the host code
should not touch the WarpGemms at all, and there would be no issue. This
has been a point of friction in CK for a long time. We can discuss this
with Chris Millette.

## Tests
I was able to verify that the following CK Tile tests and examples work
with the new unified framework:
tile_tutorial_mfma_16x16x16 (gfx9, fp16, uses transpose)
tile_example_gemm_basic (gfx9, fp16)
test_ck_tile_mx_gemm_async (gfx950, microscaling fp8 and fp4)

Within the tile tutorial I was also able to use
WarpGemmMfmaF16F16F32M16N16K32TransposedCDistribution instead of
WarpGemmMfmaF16F16F32M16N16K16TransposedCDistribution to verify that
basic K iteration also works.

A little while ago I also verified that the performance did not change
in a measurable way, and the compile did not change *much* but did see
some swings up to 20% each way (faster or slower). We will need some
broader and more accurate tests for this going forward.

## Moving forward
To confidently be able to replace the existing Dispatcher and WarpGemm
framework with our own, we need to make sure that all existing tests and
examples work on all platforms. Furthermore, we should pay attention to
performance and compile time of all these tests. Performance should
definitely not change, as all we're doing is refactoring the support
structure around the intrinsics, which should melt away during
compilation.

## New issues
(I will make new issues with descriptions for these but here is a short
list (incomplete):

Test RDNA CK Tile pipelines
Test Sparse Ck Tile pipeline (does not exist but we can make one)
Remove MmaOp flags from unification framework and update it to work with
new WarpGemmParamsParser instead.
Add Swizzle support and test in CK Tile pipelines.
Test Scale + transpose Ck Tile pipelines.
Coherent strategy for attrnumaccess for dense, scale, default, packed,
wmma, gfx1250, etc in CK tile. It's messy now.
Dispatcher should not be determining scale-ness of intrinsics based on
MNK sizes.
Try adding back the MN composition in MmaPipelines
Why is test_amdgcn_wavewise_mma only compiled for CDNA?
Investigate NOP and AGPR flags
Maybe get rid of WmmaTag in dispatcher.
Find a coherent strategy for dealing with host vs device compile passes,
and the host sneaking a peak at WarpGemm internals. Related to
getCMakeCompilerTarget().

## TODO before merge
Some changes exist just for ease of testing, and will be reverted before
merging:
- gemm_basic.cpp has a lot of datatypes disabled because otherwise
compile time is huge for testing
- USE_NEW_UNIFIED_FRAMEWORK is set to 1 for easier testing
2026-06-24 13:35:25 +00:00
Enrico Degregori
55e30feac6 [rocm-libraries] ROCm/rocm-libraries#8637 (commit a1a7f5f)
[CK] Fix compilation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-20 02:08:58 +00:00
Enrico Degregori
2733e75900 [rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)
[CK Tile] Async support pipeline V3

## Motivation

Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)

## Technical Details

- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type

## Test Plan

New test `test_ck_tile_gemm_pipeline_compv3_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 06:57:14 +00:00
Sami Remes
a3a12b8945 [rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)
[CK_TILE] Enable full transpose layout support for MX GEMM
 pipeline (#5813)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Enable full transpose layout support for MX GEMM pipeline (32x32x64
MFMA)

### Summary

This PR enables all four matrix layout combinations (Row/Col, Row/Row,
Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp
tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously,
only the canonical `A=RowMajor, B=ColumnMajor` layout was supported.

### Changes

**Kernel-side transpose support:**

- **`warp_gemm_attribute_mfma.hpp`**: Introduce `kSplitFactor` logic in
`get_warp_dstr_encoding` to split the K-dimension distribution encoding
when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This
satisfies the `TransposeTileDistributionTraits` suffix validation
required by `load_tile_transpose`. The distribution encoding now also
receives the `DataType` template parameter to compute the split factor
based on packed element size.

- **`gemm_pipeline_ag_bg_cr_comp_async.hpp`**: Uncomment and enable the
`InputTileDistributionTraits` logic to properly transform LDS load tile
distributions for transposed reads. Add `static_assert`s to catch
misconfigurations where a layout requires transpose loads but the warp
tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr`
limits).

- **`load_tile_transpose.hpp`**: Fix `DataVec` sizing for packed types
(`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer
overflow when each physical element contains multiple logical values.

- **`warp_gemm_attribute_mfma_impl.hpp`**: Set `kDefaultScale` to
`0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()`
overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring
correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`.

- **`warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`**: Add generic
`WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher
specialization to support arbitrary MX data type combinations (fp4, fp6,
fp8) with the 32x32x64 MFMA, consolidating the existing type-specific
aliases.

- **`gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`**: Simplify
`wg_attr_num_access` determination — `Double` for fp8, `Single`
otherwise.

**Reference implementation fix:**

- **`reference_gemm.hpp`**: Fix nibble selection for packed 4-bit types
(`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and
`reference_gemm_abquant`. The previous logic used `k % 2` or
`index[K_DIM] & 1` to select which nibble to extract, which assumed K
was always the fast (contiguous) memory dimension. This is only true for
`A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the
flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its
parity to correctly select the nibble regardless of layout.

**Test infrastructure:**

- **`test_mx_gemm_config.hpp`**: Add `MxGemmConfig32` base and
`MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64
warp tile.
- **`test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`**: Add `Config32`
test suites covering all four layout combinations. Restrict `Config16`
(16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds
`ds_read_tr` limits.
- **`test_mx_gemm_util.hpp`**: Fix scale tensor layout — scales are
always row-major `[M, K/32]` and column-major `[K/32, N]`, independent
of A/B data layout.

### Test plan

- [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)
2026-06-18 17:05:09 +00:00
Enrico Degregori
1762eaeaec [rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb)
[CK Tile] EightWaves pipeline int8 support

## Motivation

EightWaves pipeline currently is supporting only FP types

## Technical Details

 - Enable 16x16x64 int8 instruction for gfx950 in dispatcher
 - Enable int8 in EightWaves pipeline
 - Add tests
 - Fix bug in `warp_gemm_attribute_mfma_impl.hpp`

## Test Plan

Tests have been added for int8 GEMM using EightWaves pipeline

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 12:59:59 +00:00
Aviral Goel
c43b550206 [rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0)
[GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to
 MX GEMM TDM pipeline (#8202)

Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM
pipeline, building on the scale16 warp-gemm layer already in develop.

- **warp gemm:** add the 32x32x128 f8f6f4 scale16 traits and alias (2x2
grid of 16x16x128 scale16 intrinsic calls with per-subtile
`SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's
`IsScale16` path.
- **default policy:** select the warp gemm via the dispatcher with
`IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each
pick the matching scale16 path; guard WarpTile M/N to 16 or 32;
scale-tile distribution for the scale16 layout.
- **pipeline V1/V2:** thread `Problem::ScaleBlockSize` through the
scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize`
for the kernel.
- **block gemm:** extract int64 (scale16) / int32 (scale32) scales by
width.
- **kernel:** scale16 descriptor order; reject unsupported
`BlockScaleSize`.

Test coverage for this path is in the stacked follow-up PR.
2026-06-17 16:41:00 +00:00
Yung-sheng Tu
e826b2eb7e [rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR adds two specialisations related to `tf32_t`.

## Technical Details

This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 12:27:41 +00:00
Aviral Goel
267ca67001 [rocm-libraries] ROCm/rocm-libraries#8028 (commit c1cb112)
[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 via
 fused-downconvert override (#8028)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Adds `__builtin_amdgcn_wmma_bf16f32_16x16x32_bf16` (fp32 accumulate →
bf16 output) to the CK Tile WMMA warp-gemm path. **API only** — the unit
test is split into a stacked PR (#8035) so this API change can be
reviewed in isolation.

## Changes (4 files)

- **16-bit trait:** `wmma_intrinsic_downconvert` (calls the bf16f32
builtin — fp32 C in, bf16 C out) plus `COutDataType = bf16_t` /
`COutVecType`.
- **`WarpGemmAttributeWmmaImpl` / `WarpGemmAttributeWmma`:**
`mac_downconvert(c_fp32, a, b)` (kTransC-aware) returning the bf16
C-output vector.
- **`WarpGemmImpl`:** `mac_downconvert` tail handler producing a bf16
C-output tile from the fp32 accumulator tile, reusing
`CWarpDstrEncoding` (output layout identical to the f32 C tile).

Verified on gfx1250 (via the stacked test PR #8035): the test passes;
the existing WMMA warp-gemm test is unaffected (additive change only).
2026-06-05 05:01:31 +00:00
Copilot
4fcd73a98e [rocm-libraries] ROCm/rocm-libraries#7974 (commit 9df2c76)
composablekernel: remove stray *.hpp.bk backup artifacts
 (#7974)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Four `*.hpp.bk` files were accidentally committed to
`projects/composablekernel/`, likely as leftovers from a prior merge or
conflict resolution. Each is an older snapshot of its `.hpp` counterpart
— the canonical `.hpp` files are newer and contain the correct current
content.

## Deleted files

| File | vs. `.hpp` counterpart |
|---|---|
| `ck_tile/core/tensor/tile_window.hpp.bk` | Older version: uses legacy
`bool isL1Cache`/`PrefetchL1` template params; missing
`DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp`
include |
| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` | Older version:
missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines)
for gfx1250 architecture |
| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` | Older version:
missing `WmmaTag`, `IsScale16` template param, and several newer
dispatcher specializations |
|
`ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk`
| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses
`static_ford` (since refactored to nested `static_for`) |

## Verification

- No other `.bk` files exist in `projects/composablekernel/`.
- No build scripts, CMake files, includes, or documentation reference
these `.bk` files.
- No `.hpp` files were modified.
2026-06-04 03:06:43 +00:00
Aviral Goel
e01603bc31 [rocm-libraries] ROCm/rocm-libraries#7725 (commit eef7e12)
[GFX1250][CK_TILE] Add scale16 warp gemm unit tests

## Summary
- Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp
gemm layers for gfx1250
- Add comprehensive wave-level unit tests for scale16 warp gemm
(16x16x128 and 32x32x128 tile sizes)
- Test all fp8/bf8 type combinations and TransposeC variants
- Fix WarpGemm wrapper for non-uniform scale16 configurations

Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init).
Pipeline enablement follows in the next PR.
2026-06-03 22:05:29 +00:00
Aviral Goel
99ab4c4ef7 [rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58)
[CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32)
 (#7830)

Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250,
enabling bf16-input bf16-output WMMA at the warp GEMM level.

- Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16,
16,16,32>
- Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias
- Add Dispatcher entry for bf16->bf16 16x16x32
- Add warp_gemm test with reference GEMM validation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 13:54:16 +00:00
Tianyuan Wu
22a99f97e8 [rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93)
[CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile

## Motivation
This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and
improves warp GEMM unit test coverage/structure for gfx1250-specific
cases.

## Technical Details

- Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added
IsScale16 plumbing to warp GEMM dispatcher path
- Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250
coverage to verify all F4 WMMA paths

## Test Plan
Run ./test_ck_tile_wg_32x16x128_fp4.

## Test Result
```
./test_ck_tile_wg_32x16x128_fp4
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1751 ms total)
[  PASSED  ] 3 tests.
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-30 01:28:48 +00:00
Andriy Roshchenko
d5c9215064 [rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC

## Motivation

Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.

## Technical Details

The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.

## Test Plan

Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.

To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.

## Test Result

The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
2026-05-29 17:02:45 +00:00
Anton Gorenko
66d6714376 [rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)

## Motivation

Improve precision of mxfp4 without performance penalties.

## Technical Details

Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.

It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 06:55:17 -07:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
JiaLuo-CAN
5ff7497fa7 [rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4)
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537)

<Migrate from Internal repo PR>
test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8
cases:
std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant,
False, True, False>,
std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant,
False, True, False>

GFX1250 would fail with incorrect results, GFX950 would fail when
compiling BF8+FP8 and give incorrect results for FP8+BF8.
The issue is due to the wrong ComputeDataType selection.
The fix is to consider original ADataType and BDataType even when
ComputeDataType is not void. For compiling error on gfx950, the bf8,
fp8, 16x16x32 warp Gemm is added.
2026-05-21 08:36:23 -07:00
Illia Silin
717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
2026-05-15 06:46:51 -07:00
ltqin
501e7ef12a [rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)
[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574)

## Summary

Add a CK_TILE forward kernel implementing [SageAttention
v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that
applies multi-granularity quantization to Q/K/V before computing
attention, trading minimal accuracy loss for higher throughput on
low-precision hardware.

### Quantization design

| Tensor | Supported data types | Scale granularity options |
|--------|---------------------|--------------------------|
| Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(32 tokens), per-thread (4 tokens) |
| K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(64 tokens), per-thread (16 tokens) |
| V | fp8 | per-channel (always) |
| O | bf16 | — |

Three precision combinations are supported: `fp8/bf16` (QKV fp8, O
bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK
int4, V fp8, O bf16).

### Architecture support

- **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set
- **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for
fp8-family dtypes)

### Implementation

- Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC`
(async copy)
- Masking support: no mask, causal (top-left / bottom-right), and
generic windowed
- Batch and group (variable-length) modes
- Head dimension: d=128, d_v=128
- Python codegen under `example/ck_tile/49_sageattention/codegen/`
generates kernel instances per target/dtype/tile combination
- Smoke tests included via `tile_example_sageattn_fwd`

### Test commands

\`\`\`bash
# fp8 QKV
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=fp8bf16 -qscale=3 -init=3

# int8 QK, fp8 V
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=i8fp8bf16 -qscale=3 -init=3
\`\`\`

\`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread
2026-04-30 11:32:23 -07:00
chris-tsiaousis-hpc
1c04029c0e [rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca)
[CK Tile] Unification work - mma transformations pipeline (#5508)

## Motivation

In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.

A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.

We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.

## Technical Details

The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.

## Test Plan

New tests are added.

## Test Result

Tests should pass.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-04-14 09:25:01 +02:00
Po Yen Chen
bf736dfa74 [rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2)
[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)

## Motivation

The existing FMHA V3 pipeline only supports fp16/bf16 data types. This
PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950,
enabling higher throughput for
  FP8 inference workloads using the assembly-optimized V3 code path.

  ## Technical Details

  **Warp GEMM:**
- Add FP8 32x32x32 warp gemm with C-transposed distribution
(`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries

  **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):**
- Add per-tensor descale support for Q, K, V tensors, passing descale
pointers through to pipeline kargs

  **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):**
  - Add FP8 data path with dtype-aware type selection
  - Add asm volatile P matrix conversion from f32 to fp8
  - Add FP8-aware instruction scheduling in `CoreLoopScheduler`

**V3 Pipeline Policy
(`block_fmha_fwd_v3_pipeline_default_policy.hpp`):**
- Add FP8 QK warp gemm selection (SwizzleB variant for V tile
distribution compatibility)

  **Codegen (`fmha_fwd.py`):**
  - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128)
- Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale:
no/pertensor)
  - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor

  **Misc:**
- Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum
(`arch.hpp`)
- Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`)

V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`)
pending further validation.

## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950)

  | Problem | Regular (TFlops) | Varlen (TFlops) |
  |---|---:|---:|
  | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 |
  | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 |
  | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 |
  | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 |
  | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 |
  | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 |
  | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 |
  | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 |
  | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 |
  | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 |

## Test Plan

Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering
batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40),
GQA ratios (1:1, 1:8), and
causal/non-causal modes. Verified all cases dispatch to the V3 pipeline
by enabling `F_is_v3_enabled` and confirming kernel names contain
`qr_async_trload_v3`.

  ## Test Result

176/176 tests passed with V3 enabled. All cases correctly dispatched to
V3 pipeline with `pertensor` quantization.

  ## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-07 22:19:28 +08:00
assistant-librarian[bot]
544a3182c1 [rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)
[CK_TILE] add tf32 support (#4302)

## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion

---
🔁 Imported from
[ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538)
🧑‍💻 Originally authored by @yingluAMD

---------

Co-authored-by: yingluAMD <Yingmao.Lu@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-19 10:17:20 +01:00
chris-tsiaousis-hpc
f43b7421c7 [rocm-libraries] ROCm/rocm-libraries#5241 (commit 43daeac)
Changed the include order of the new WMMA/MFMA unification framework (#5241)

Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.

## Motivation

Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.

Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.

## Test Plan

The CI (that runs `remod.py`) should compile.

## Test Result

Existing CI should compile and be green.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-03-12 09:26:58 +01:00
Anton Gorenko
5f94c1aa0f [rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 09:59:50 +00:00
Sami Remes
c1525b3f30 [rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb)
[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)

## Motivation

Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-03-10 20:12:05 +00:00
chris-tsiaousis-hpc
fe128b581f [rocm-libraries] ROCm/rocm-libraries#4837 (commit 6316035)
[CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837)

## Motivation

The existing unification work supports DENSE intrinsics. In this PR we
enable support for SPARSE as well as SCALE intrinsics and add an example
SPARSE implementation.

## Technical Details

Mostly trivial changes. One framework change is that the desired
`MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant
commit explains, we do not support a fallback family at the moment, but
it is something we can consider.

## Test Plan

Added a new test for the relevant sparse specializations.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-03-05 19:52:04 +00:00
Ville Pietilä
68b0f420ba [rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)
[CK_TILE] Add CK Tile bwd weight profiler (#4797)

## Motivation

To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.

## Technical Details

- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.

---------

Co-authored-by: Ville Pietilä <>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2026-03-04 21:49:42 +00:00
Thrupti Raj Lakshmana Gowda
720d7fa02b [rocm-libraries] ROCm/rocm-libraries#4592 (commit 45f76cb)
Tile Engine support for gfx950 (#4592)

## Motivation

This PR adds support for the gfx950 GPU architecture to the Tile Engine
in Composable Kernel library, focusing on GEMM operations with FP8 and
BF8 data types.

## Technical Details

Added gfx950-specific MFMA warp GEMM implementations with conditional
compilation.
Updated default GEMM configuration parameters for tile sizes and warp
configurations.
Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950
hardware.

## Test Plan

Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.

## Test Result

Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Thrupti Raj Lakshmana Gowda<ThruptiRaj.LakshmanaGowda@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-02-26 10:14:40 -06:00
Bartłomiej Kocot
d244aaa1c0 [rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check (#4791)

## Motivation

Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.

Change number of KPack for generated instances

## Technical Details

Remove permute instructions from assembly

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: jakpiase <jakpia21@gmail.com>
2026-02-24 22:40:48 +01:00
assistant-librarian[bot]
023ba6848e [rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant (#4267)

## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

---
🔁 Imported from
[ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689)
🧑‍💻 Originally authored by @EnricoDeg

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-24 09:55:50 -08:00
Anton Gorenko
841e6b89d1 [rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11 (#4584)

## Motivation

Add support of gfx11 architectures (RDNA3) to FMHA.

## Technical Details

Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).

Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.

Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.

## Test Plan

```
ninja test_ck_tile_fmha

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```

## Test Result

All tests must pass (some tests may be skipped).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-20 17:15:10 -08:00
Cong Ma
d06f35027a [rocm-libraries] ROCm/rocm-libraries#4354 (commit d41f08a)
[CK TILE] fix numerical errors of preshuffle_b

This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.

# The numerical issue

## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```

# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.

**Key changes include:**

### Data layout and shuffling logic

* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)

### Test infrastructure enhancements

* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-11 07:05:46 +00:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
Po Yen Chen
8c1788757a [CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672)
* Change call to the intrinsics

* fix clang format

* Undo changes under include/ck/utility

* Use named variable as vector size

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-01-30 12:02:49 -08:00
Bartłomiej Kocot
0727e85e52 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes
2026-01-19 22:29:01 -07:00
joyeamd
b78563b3d3 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm
2026-01-05 23:39:00 -08:00
yadaish
c0ee71d735 Dev/a8w4 and a8w8splitk (#3447)
* Ck moe bs splitk pr (#3440)

* splitk kick-off. Compilation fail

* splitk hack pass

* fix scale offset calc.

* clang-format for a8w8_moe_blk_gemm1 splitk change

* fix testcase error

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>

* Zan/moe a8w4 (#3441)

* update

* update

* update ck moe a8w4

* update

* update

* update

* compile pass

* update

* update

* python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready

* support new a8w4 kernel

* update

* update ck_tile

* re format

* update

* update

* fix conflict

* fix build

* update ck_tile moe

* fix clang format

* fix the problem

* fix accruacy issue

* fix

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-12-19 09:26:52 +08:00
Yi DING
57e1e4a848 [CK_TILE] Add FP8xF4 Flatmm (#3401)
* Refactor policy

* fix a bank conflict

* Enable mixed mx flatmm

* Update
2025-12-17 10:01:48 +08:00
Gino Lu
ba6af9fe7c [CK_TILE] Add unit test for fp4 warp gemm (#2817)
This update includes a unit test for warp GEMM
2025-12-01 13:56:48 +08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Yi DING
8fa90025d0 [CK_TILE] Refine warp_gemm_attribute_mfma (#3272) 2025-11-26 10:57:15 +08:00
Yi DING
8d50001b93 [CK_TILE] Improve F8F6F4 Scaled WarpGemm (#3197)
* [CK_TILE] Improve F8F6F4 Scaled WarpGemm

* Thanks, Copilot
2025-11-13 20:22:05 +08:00
linqunAMD
1b1c46e508 [CK_TILE] Fix gemm_quant (#3186) 2025-11-11 08:23:57 -08:00
John Afaganis
3f996ee738 Add copyright notices to missing files (#3133) 2025-10-31 07:35:11 -07:00
Yi DING
e135dd518d [CK_TILE] Add mxfp4 flatmm (#3080)
* Squashed commit of the following:

commit 3e1a851dad834776efbe4fe365ac82c4ed312010
Author: Ding, Yi <yi.ding@amd.com>
Date:   Thu Oct 23 06:10:54 2025 +0000

    Fix & clean after rebase

commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6
Author: Ding, Yi <yi.ding@amd.com>
Date:   Wed Oct 22 10:46:13 2025 +0000

    Squashed commit of the following:

    commit 0b6b9dbd1b
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 02:04:27 2025 -0500

        fix bandwidth calculation

    commit 9aebf53bb7
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 00:58:59 2025 -0500

        updates

    commit 62607de56c
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Sep 19 00:39:46 2025 -0500

        fix a bug, set the A DS_read preload size to 4 for MXFP4

    commit 92ad6fcc0a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Thu Sep 18 01:19:03 2025 -0500

        fix a_wrap preload issue for large MPerBlock.

    commit f2db44710f
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 21:34:03 2025 -0500

        optimized the VGPR repack issue for MXFP4

    commit 346a400027
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Wed Sep 17 04:19:44 2025 -0500

        fix time error

    commit 80c1743034
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 03:58:00 2025 -0500

        updated, function passed.

    commit ce26d9071e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 22:21:39 2025 -0500

        fix, function partially passed

    commit 0a89ed13a5
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 03:01:12 2025 -0500

        fix, reference function passed, next check kernel function

    commit ec9bcef591
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Sep 16 02:29:01 2025 -0500

        let pack/unpack return pk_fp4_t

    commit a333206929
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 20:50:26 2025 -0500

        fix

    commit 3893c06540
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 15 05:51:06 2025 -0500

        fix bug

    commit 8052bea019
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 04:02:05 2025 -0500

        fix core dump issue, function is not correct.

    commit 9ceb3fd508
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 03:03:02 2025 -0500

        updates, build pass

    commit cc94eb6045
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 00:05:18 2025 -0500

        updates

    commit 22586c3135
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Sun Sep 14 23:40:28 2025 -0500

        fix bug

    commit e92e67b8dd
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 03:28:50 2025 -0500

        fix interface

    commit 8b1dd60c08
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 02:53:50 2025 -0500

        add interface in warp_gemm_impl

    commit c6135f6abe
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 10 05:03:08 2025 -0500

        updates some fixes.

    commit b0d71b8d19
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 9 04:37:42 2025 -0500

        fix after merge ginolu/add_wgmfma_dispatcher

    commit f119c30317
    Merge: c5030e602 72c8ef856
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 22:09:15 2025 -0500

        Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev

    commit c5030e602e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 21:42:47 2025 -0500

        update mx flatmm tail pipeline

    commit 72c8ef8567
    Merge: 9661bb400 e4a772890
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:10:23 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 9661bb400b
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:09:55 2025 -0500

        fix type error

    commit 0509597f55
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 04:01:40 2025 -0500

        update hotloop pipeline

    commit 754ae0461b
    Merge: 15d44406e 83f607e2a
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:22:26 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 15d44406e5
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:21:26 2025 -0500

        fix clang format

    commit 146963d62a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 3 10:00:54 2025 -0500

        some updates

    commit 12526b626a
    Merge: 47cee0471 00fd72b2d
    Author: asleepzzz <hanwen.chang@amd.com>
    Date:   Wed Sep 3 13:22:03 2025 +0800

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 47cee04712
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 02:11:02 2025 -0500

        fix vec size error

    commit d2892925e5
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 01:23:39 2025 -0500

        fix format error

    commit 16993acd1d
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Sat Aug 30 03:19:07 2025 -0500

        update codes

    commit 9c37e55d13
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Aug 29 11:27:33 2025 -0500

        init ck_tile mxfp4 flatmm

    commit 5c484a5672
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 28 08:02:50 2025 +0000

        Add bias for f16xf4 moe_flatmm

    commit dd6539f366
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 27 13:39:47 2025 +0000

        update case construction

    commit 65b702454c
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 26 12:32:29 2025 +0000

        support swiglu activaion and use rcpf to accelerate silu

    commit b422e41e08
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Aug 26 02:33:55 2025 -0500

        first commit

    commit d05eed931d
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 04:01:59 2025 -0500

        add line to last

    commit d69cab7f0c
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 03:20:46 2025 -0500

        adjust A_LDS descriptor to avoid bankconflict

    commit 65989e940c
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Thu Aug 21 09:46:52 2025 -0500

        enable hotloop

    commit c378e9bdf8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 09:12:21 2025 +0000

        support atomic_pk_add_bf16 on gfx950

    commit 85976b0b87
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 06:58:55 2025 +0000

        use int64_t as expert stride to avoid overflow

    commit 9fbcc8f8a4
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 13:53:32 2025 +0000

        use v4i32 as the storage type for B to avoid repack operation

    commit 81899bd920
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 06:40:03 2025 +0000

        add pk_fp4_t and e8m0_t support for amd_buffer_load_impl

    commit c27eb0771a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 04:39:14 2025 +0000

        optimize cvt_pkf4_to_f16 implementation

    commit 3ca0bd500a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 19 14:56:46 2025 +0000

        optimize A_LDS descriptor to avoid bankconflict

    commit f7f0306eea
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 18:43:37 2025 +0000

        fix gate-up when GU_NRepeat > 1

    commit be55c0f9cb
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 17:28:11 2025 +0000

        add fp16xf4 moe

    commit 599e1f5b32
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Aug 17 17:51:18 2025 +0000

        rename example

    commit 7899fb4a8d
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 15 06:20:46 2025 +0000

        remove additional check when e8m0->float

    commit 714b341797
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 14 09:34:12 2025 +0000

        eliminate repeat dequant

    commit 53e8c0c533
    Merge: 5de620895 cc9c7b9e5
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:51:49 2025 +0000

        Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm

    commit 5de6208952
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:16:48 2025 +0000

        update f16xMXF4

    commit 732ebdee8b
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 10:48:53 2025 +0000

        update scale-preshuffle for MXF4

    commit edb58d0680
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 11:24:34 2025 +0000

        update

    commit cc9c7b9e58
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 08:38:23 2025 +0000

        optimize gemm2 atomic_add pattern

    commit 200a11afc8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:59:47 2025 +0000

        update scale for mxfp4

    commit 87aed564dc
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:56:14 2025 +0000

        update case construction

    commit 8b85fa6cf2
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 06:03:06 2025 +0000

        update granularity control

    commit 1b8c7097b8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 03:42:46 2025 +0000

        fix TileConfig

    commit 8ba1c708dc
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Thu Aug 7 21:37:28 2025 +0800

        Add e8m0 scaled convert into CK_TILE (#2617)

        * first commit

        * remove redundent code

        * modify according to comments.

        * fix type_convert error with scaled_type_convert

    commit f788d3d629
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 8 20:19:16 2025 +0000

        add mixed_prec fp16xfp4

    commit 3dea10a277
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 7 09:22:04 2025 +0000

        debug mixed_prec flatmm

    commit 0ba513b148
    Merge: 90e910f3a c0cb4d036
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Aug 6 16:49:47 2025 +0800

        Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk

        fix split k

    commit 6d3cbc7c0e
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 6 08:33:33 2025 +0000

        add moe_flatmm

    commit c0cb4d036d
    Author: coderfeli <coderfeli@163.com>
    Date:   Wed Aug 6 02:45:31 2025 +0000

        fix split k

    commit 90e910f3a7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 4 07:16:36 2025 +0000

        fix flatmm with scaling when WarpTileM == 32

    commit aa5e008fa5
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 11:01:23 2025 +0000

        optimize scaling epilogue

    commit ac5908c0bb
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 07:28:38 2025 +0000

        fix wrong config for fp8 scaling

    commit 3f43b841d4
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 06:20:30 2025 +0000

        prune debug message

    commit 2e5d4c74cd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 04:52:08 2025 +0000

        fix compile error

    commit c117a1986a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Jul 29 15:42:58 2025 +0000

        Add persistent option on flatmm for tuning

    commit a587701117
    Author: AMD-dteng <dteng@amd.com>
    Date:   Tue Jul 29 22:48:00 2025 +0800

        update pipeline v1: add atomic IGLP schedule

    commit f9e48148d2
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 09:09:27 2025 +0000

        fix error log throwing

    commit 1b6d7cf407
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Jul 28 08:24:51 2025 +0000

        crz idea

    commit 5473f06461
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Jul 27 11:57:38 2025 +0000

        Add permuteN optimzization when NRepeat % 2 == 0 on flatmm

    commit bfb9f4002f
    Author: sjfeng <j514681085@icloud.com>
    Date:   Sun Jul 27 17:24:08 2025 +0800

        try to remove c_shuffle_lds

    commit 1264f4d2ab
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Jul 25 07:41:48 2025 +0000

        fix loop-dim mismatch and improve c_shuffle alu parallelism

    commit 1239d8a546
    Merge: 406645448 b908f5e80
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 08:46:51 2025 +0000

        merge flatmm -scale

    commit 4066454483
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 16:19:58 2025 +0800

        revert delete of inc file

    commit 68390988c9
    Author: solin <bingzhou@amd.com>
    Date:   Thu Jul 24 04:38:16 2025 +0000

        reorg  flatmm code

    commit b908f5e803
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:12:31 2025 +0000

        fix flatmm syntax error on gfx950

    commit 5a1183ebbd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:04:22 2025 +0000

        support flatmm scaling

    commit 89fa639207
    Author: valarLip <340077269@qq.com>
    Date:   Wed Jul 23 08:44:12 2025 +0000

        merge flatmm pipe v0 from dteng_flatmm_opt

    commit 3f7d848dd3
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:38:12 2025 +0800

        build pass

    commit 6dacf833da
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 07:20:26 2025 +0000

         fix bug

    commit 7e1bd4b839
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:01:53 2025 +0800

        sync

    commit 46a538e39e
    Author: valarLip <340077269@qq.com>
    Date:   Tue Jul 22 08:09:35 2025 +0000

        adaptive scheduler instead of Macro definition

    commit 9aa3396a79
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 17 08:40:35 2025 +0000

        fix tail handler bug

    commit fb76450e63
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 16 10:12:19 2025 +0000

        merge from dteng_flatmm_opt

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

* Fix crash on small M

* Apply suggestion from @Copilot

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-10-31 11:29:05 +08:00
SamiAario-AMD
254bce9346 Lwpck 3550: Implement and test fixed precision fp8 x bf8 (#2963)
* HasHotLoop is a constexpr

* Remove an unused function

* Remove some unused include statements

* Add implementation and tests for fp8 x bf8 weight preshuffle GEMM

* Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs

* Remove two barrier calls that HotLoopScheduler already calls

* No need to suppress a variable that hasn't been declared

* Replace six arg_parser arguments with constexpr literals

* Simplify run_gemm_test_prec_type

* The strides don't need to be passed via arg_parser as we use their default values

* The layouts don't need to be passed as arguments twice

* Pass M N and K as regular arguments, not using the argument parser

* We can now remove the argument parser

* Add a common file for precision types to be used in testing

* Convert basic and universal GEMM tests to use gtest

* Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes

* Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on

* Added a changelog entry

* Add missing copyright statements

* ifndef-define-endif is not needed with pragma once

* Fix a comment

* Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp

* Disable the unreliable test MoeSortingCase4

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-30 13:36:10 +01:00
Anton Gorenko
1e77695fe8 [CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests

* Add WMMA support to fwd FMHA pipelines

* Tune tile sizes a bit for less spilling

fp16 256 is still quite slow

* Fix Q grad tile distribution for warp size = 32 and hdim >= 256

With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.

* Use code based on BlockDropout in BlockDropoutBwd

* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal

* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
  because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
  it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
    * any number of warps (power-of-2), not only 4;
    * kN1 = 16, not only 32;
    * any number of splits;

* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now

* Replace hard-coded kN1 in dispatch code with the requested tile size

* Add gfx12-specific tile sizes for split KV

* Pass GPU architecture to kernel generation scripts

This is still a temporary solution.

* Build and run FMHA CI tests for gfx12

* Fix issue after merging

* Fix bwd tile sizes

The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).

* Use hardware f32->f8 on gfx12, remove v_perm

__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
 32-bit dword) is used to store results (two f8 values).

* Update changelog

* Add WMMA support to pagedkv

* Fix scripts after rebasing

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Fix names after cherry-picking

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Do not use filters related to qr_async_trload

They disable tiles/pipelines which are valid for gfx12.

* Use different dstr encoding when C is transposed

* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code

Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().

* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution

Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.

* Generalize usages of WarpGemmDispatcher for MFMA and WMMA

WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.

* Mark has_load_tr as maybe_unused

There are no transpose loading for RDNA.

* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code

* Detect BlockSize on host based on warp size of the current device

If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.

* Fix calculation of grid size for combine kernel with warp size = 32

* Add missing includes and header

* Support multiple archs in one binary for fwd

* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill

* Support multiple archs in one binary for bwd

* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
  they can be used as fallbacks (similarly to fwd);

* Extract common code from register_traits

* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"

To simplify merging , the proper fix is in develop already.

* Support new numerical d paddings in trait ordering checks

* Build fp32 tests only on gfx9

* Do not use hardcoded M0 = 64 for dot bwd kernel

* Use textwrap.indent from standard library

* Make fp8 pipelines on gfx12 consistent with gfx9

* Update tests for current pipelines

* Make ninja check more responsive in CI

ninja buffers output so this job looks hanging.

* Support fp8fp32 by limiting O vector size

The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).

* Remove unused cmake options

* Unify including  amd_buffer_addressing.hpp/_builtins.hpp

* Temporarily use amd_buffer_addressing.hpp on >=gfx10

amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
 * 1 scalar for exec masks instead of 2,
 * gfx12 uses different instruction names etc.

* Update asm in bf16 conversions to work with warp 32

* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd

* Add arch tags to kernels/host funcs, compile for each arch separately

* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename

* Add workaround for miscompilation of bwd with padded hdim

SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.

* Fix add_gtest_executable for absolute paths

Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.

* Run only hdim 128 smoke tests for fp8fp32

There are no instances for hdim 64 and 256.

* Format py with ruff to simplify merging develop

* Fix incorrect var name

* Codegen for gfx9,gfx950 when --targets is not specified

Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).

* Combine arch-related values into ArchTrait

This more centralized approach removes duplication of various formatting templates.

* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"

Some code is extracted into a function.
2025-10-29 13:31:08 -07:00
Gino Lu
bedade2572 [CK_TILE] Add fp4 warp gemm 16x16x128 (#2738)
* first commit

* fix format error

* fix vec size error

* fix clang format

* fix type error

* add interface in warp_gemm_impl

* fix interface

* fix bug

* fix bug

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-23 10:55:51 -07:00
Khushbu Agarwal
81458a6681 Weight Preshuffle Block Scale gemm support (#2877)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

* initial commit

* debugging

* working fp8 for init constant

* fp8 working with all inits

* updated block level code with comments

* changing the loop iter

* debugging

* debugging

* debugging

* code fix

* code clean up

* clang formatted

* Add comment

* code cleanup

* clang formatted

* merge conflicts fixes

* applying the latest int4 changes to the piepline

* fixing test code for updated traits

* Adding gtest

* review comments addressed

* addressing review comments

* remove c++20 code

* added flush cache changes

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
2025-09-29 12:46:37 -07:00
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00