8 Commits

Author SHA1 Message Date
Sami Remes
c1f7104852 [rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01)
[CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K,
 unsupported-shape guard (#6663)

Three independent MX GEMM correctness bugs reported against
example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one
host-side atomic-add accumulation bug in the example's repeat loop.

- Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop
required num_loop > PrefetchStages, which let num_loop == 3 enter a hot
loop that produced 5 gemm accumulations instead of 3 (K == 3*K_Tile,
e.g. K=768, deterministically wrong). Require num_loop >= 4 instead:
pre-pipeline + TailNumber::Three already totals exactly 3.

- Kernel (gemm_mx_kernel.hpp): split-K was silently broken because
GridSize did not thread k_batch into blockIdx.z and the scale tile
windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore
read the wrong packed scales. Fix:
* GridSize returns dim3(grid_x, 1, k_batch) (persistent and
non-persistent).
* MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a
packed-scale K offset (also apply pad_tensor_view so OOB scale loads
return zero, matching A/B padding).
* operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset
(matches Underlying::SplitKBatchOffset's K1-aligned formula), and
dispatches the epilogue with memory_operation_enum::atomic_add for
k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard
as UniversalGemmKernel.
* MakeCBlockWindows templated on DstInMemOp; unconditionally applies
pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are
handled correctly.

- Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp):
add IsSupportedArgument and a static_assert for configurations that
produce silent wrong results:
* static_assert(!kPadK) -- the MX comp-async pipeline uses
async_load_tile whose OOB check is per-vector-start, so a vector
straddling the K pad boundary reads garbage. Until the async path learns
per-element pad masking, reject kPadK at compile time.
* Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when
kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always
(CShuffleEpilogue cannot safely run with a single partial tile); K %
(KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of
WarpTile_K * k_batch so every split lands on a packed-scale boundary.
  * All error paths log under CK_TILE_LOGGING with actionable messages.

- Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp):
* Call Kernel::IsSupportedArgument up front and throw a clear
runtime_error for rejected shapes (was silently launching an unsupported
kernel).
* Switch to launch_kernel_time_mask with a clear_gemm_output preprocess
that zeroes C between iterations when k_batch > 1 (mirrors
universal_gemm_invoker). Without this the default -warmup=50 -repeat=100
accumulated 150 atomic_adds into C after the kernel-side split-K fix.

Tests (test/ck_tile/gemm_mx/):
- Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true).
- test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression),
SplitK (k_batch=2,4 across full_k/partial_k paths),
TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding}
covering trailing partial tiles along M, N, or both.
- Run(...) now takes k_batch.
- packScalesMNxK: guard against OOB (mn, k) reads from src and
initialise e8m0 bytes to the zero exponent (0x00) instead of the
default-constructed NaN (0xFF), so padded lanes don't poison the packed
int32_t shared with in-range lanes.
- test_mx_gemm_instance.hpp: call IsSupportedArgument before launch.

Verification on gfx950, ROCm 7.2.0:
- ctest -R test_ck_tile_mx_gemm -> 100% (2/2).
- Example sweep over the original bug-report shapes: all K-aligned
shapes now validate correct (including 4096^3 sk=2 and the K=768 cases);
all K=128 shapes cleanly rejected with the new error message instead of
producing silent wrong results.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 08:28:55 +00:00
Enrico Degregori
bdd7a8333d [rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97)
[CK Tile] PermuteN support MX GEMM

## Motivation

Add PermuteN support to preshuffle MX GEMM

## Technical Details

 - Modify `shuffle_b_permuteN` to support MX preshuffled layout
- Add `preShuffleScalePermuteN` with same functionality of
`preShuffleScale` but layout consistent with PermuteN
 - Include MX pre-processing functions in the library

## Test Plan

Add test configuration for permuteN with preshuffle (both FP4 and FP8)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Cong Ma <congma13@amd.com>
2026-06-05 03:04:43 +00:00
JP-Fernando
74bc86240b [rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a)
[CK Tile] Add gemm universal preshuffle to MX GEMM  (#5647)

## Motivation

Add gemm universal preshuffle support to existing MX GEMM pipeline.

The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.

## Technical Details

The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.

## Test Plan

Add `preshuffle` option to example: `tile_example_mx_gemm`.

Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.

## Test Result

Example and tests were successful on `gf950` architecture in the `Alola`
cluster.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-05-22 16:07:53 +02:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
Enrico Degregori
9565ca21ec [rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2)
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)

## Motivation

Integrate Eight Waves pipeline in MX GEMM

## Technical Details

 - EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
   - Extend support of EightWaves policy for FP4 (packed types)
 - Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
 - CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
 - Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
   - Add configuration for EightWaves pipeline
 - Tests:
   - Unify existing FP8 and FP4 tests
   - Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile

Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR

## Test Plan

Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-19 11:53:19 -07:00
Illia Silin
717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
2026-05-15 06:46:51 -07:00
Thomas Ning
1ab29bf22f [rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e)
CK Tile MX GEMM Packing Improvement (#5323)

## Motivation

Reduce the scale loading size and also has better utilization of MFMA
scale selection.

## Technical Details

Add up the packing of mx scales.

## Test Plan

Use the existing test cases.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
2026-03-17 11:57:32 -07:00
Sami Remes
c1525b3f30 [rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb)
[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)

## Motivation

Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-03-10 20:12:05 +00:00