Commit Graph

440 Commits

Author SHA1 Message Date
Max Podkorytov
9d060d3e3c [CK-Tile] functional support for transposed inputs in compute-bound double-lds-buffer pipeline with async loads from global memory to LDS (#2984)
* reuse local prefetch logic from compute v4 pipeline

add single-tile test

explicit lambda capture

reuse lds block descriptors from base policy for the transposed case

match the test case kernel configuration with compute v4

* add comments
2025-10-10 12:57:50 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
fb66b4f5e4 [CK_TILE] fix pk_fp4 compilation for non-gfx950 GPUs (#2983)
See build error log from
https://github.com/ROCm/composable_kernel/issues/2271#issuecomment-3150218542

This PR make vector element access constexpr-safe by avoiding operator[] on
ext_vector_type(2) and replace those sites in the pk_fp4 conversions so they
can be used in constant expressions, as The operator[] on ext_vector_type(2)
isn't allowed in constant expressions, which caused "constexpr function never
produces a constant expression" with a note at x[0]. Using `bit_cast` to a
trivial array representation keeps it constexpr-compatible.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-09 07:43:41 -07:00
Yashvardhan Agarwal
7b6451b68e [CK_TILE] Pooling FWD (Lwpck 3683) (#2956)
* Pooling 2D/3D with refernce

* Tests & cleanup

- added test for ppoling
- cleanup
- removed 2d example

* Comment resolution

- README added
- example target name rectified
- appropriate arg description and comments added

* clang-format

* appropriate blocksize calc

* modifications for future indexing addition

- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future

* some basic fixes

* comment resolutions

* comment resolutions

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-09 16:13:26 +02:00
Aviral Goel
e99356dabc Add Memory pipeline for AQuant Block Scale GEMM (#2987)
* WIP: add memory pipeline boiler plate code that compiles and works for one block

* WIP: tail handling works for memory pipeline

* WIP: numerical errors appears to have gone by adding block_sync_lds()

* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler

* refactror: remove debug print statements and lints

* fix: remove redundant sync barriars

* chore: remove lint

* fix: remove unused code from tile handler and remove redundant block_sync_lds()

* fix: correct parent struct name for memory pipeline

* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert

* fix: defer block sync lds to just before prefill
2025-10-08 17:22:30 -07:00
Cong Ma
1d4db30af9 [CK TILE GEMM] Refactor the code of transposeC and quantpreshuffle of AQuant Gemm (#2965)
Refactor the code of transposeC and quantpreshuffle of AQuant Gemm to make it easier to maintain.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-10-08 00:05:38 -07:00
Thomas Ning
ae9f29b7d5 add the sync barrier for persistent kernel (#2977) 2025-10-07 11:54:04 -07:00
Thomas Ning
b4a4aa2b64 [CK Tile] CShuffle Tile Permute N all warp compatible (#2966)
* solve the hard_code issue of kM2

* clang format
2025-10-03 09:46:13 -07:00
Cong Ma
6fc28ab493 [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle (#2897)
* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle

When TransposeC and QuantPreshuffle are both true, Aquant generates
correct result.

* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle

- Add unit tests

* Fix bug in is_quantpreshuffle_enabled

* clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-02 11:13:51 -07:00
Max Podkorytov
a7da3c68b9 Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949)
* check in pipeline and policy

for async  load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked

solve merge conflicts

* fix cmakelists

* make it build

* fix? buffer async fence

* relax fences; it appears it only is needed between pairs of ping-pongs

* remove fences

* remove fences

* cleanup and reformat

* add steps annotations

* comment all pipeline steps / remove unexplainable syncs

* clang-format

* add comment

* cleanup kernel types for test

* fix comment

* fix hardcoded warp size

* faithfully copy block gemm from compute v4 policy to async policy

* make async test gfx950 only

* fix cmake logic

* set separate compile options for async

* refine comment in policy

* try update hotloop scheduler

* cleanup comments

* test more K block sizes

* unhardcode Ks, sort of

* add large odd test case

* fix build for quant

* add comment to hot loop scheduler and rename enum

* reformat

* reword the pipeline description

* reformat

* address review / add static asserts / typo fix

* update changelog
2025-10-01 15:38:07 -07:00
Sami Remes
ef43078788 Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe (#2893)
* Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe

* also do the same for amd_buffer_addressing_builtins.hpp

* merge with develop

* fix clang format

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-09-30 15:12:30 -07:00
joyeamd
b60af5bde9 [CK_TILE]enhance elementwise test (#2683)
* enhance elementwise

* fix ci issues
2025-09-30 08:29:37 -07:00
Aviral Goel
bebf0e9d15 Extend Grouped GEMM with MultiD (Single & Double Shared Memory) feature to use persistent kernel option (#2933)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults

* feat(grouped_gemm_multi_d): add functionality to run persistant kernel

* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite

* docs: improved README text based on review comments

* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint
2025-09-29 15:03:56 -07:00
Khushbu Agarwal
81458a6681 Weight Preshuffle Block Scale gemm support (#2877)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

* initial commit

* debugging

* working fp8 for init constant

* fp8 working with all inits

* updated block level code with comments

* changing the loop iter

* debugging

* debugging

* debugging

* code fix

* code clean up

* clang formatted

* Add comment

* code cleanup

* clang formatted

* merge conflicts fixes

* applying the latest int4 changes to the piepline

* fixing test code for updated traits

* Adding gtest

* review comments addressed

* addressing review comments

* remove c++20 code

* added flush cache changes

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
2025-09-29 12:46:37 -07:00
carlushuang
2e9428eb63 hot fix check eid range (#2924)
* hot fix check eid range

* fix clang format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-09-29 09:38:38 -07:00
SamiAario-AMD
0f10e6d921 [CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769)
* Change the return type of run_gemm_combinations in the basic tests

* Change the return type of run_gemm_combinations in the universal tests

* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4

* Add universal GEMM test for fp8 x pk_i4

* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.

* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>

* No need for utility in test_ck_tile_elementwise_1d

* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8

* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant

* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union

* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced

* Convert from float to bf16 during compilation rather than using magic values

* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8

* Comment out the basic test for fp16 x pk_i4 as it does not pass

* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8

* Add basic and universal GEMM tests for bf8 x pk_i4

* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now

* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now

* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp

* Use explicit macros for enabling and disabling the the constexpr lookup based converters

* Fix two failing tests

* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant

* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16

* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t
2025-09-29 13:34:47 +03:00
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00
Anton Gorenko
c6bfd97c2d [CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_

When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.

* Fix synchronization issue in splitkv combine pipeline

Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.

Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
2025-09-27 08:16:10 +05:00
Aviral Goel
a44bea45b2 Integrate Multi D GEMMs into Grouped GEMMs along with unit tests (#2923)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* feat: generalized grouped_gemm_kernel.hpp

* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* docs: add multi d info and trim down outdated content

* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite
2025-09-26 09:59:58 -07:00
rahjain-amd
e92e69318e Disable Rapid Json to be used by Default (#2936)
To enable the json dump we can now build with -DCK_ENABLE_JSON_DUMP=1
2025-09-26 09:05:35 -07:00
Yi DING
32773fe5cb [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918) 2025-09-26 16:42:59 +08:00
Jeff Huang
518d24e662 Add sequence padding and variable length support in fmha (#2932)
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
   `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
   `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
   specifically validate/benchmark the new padding and variable-length
   functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

* [CK_TILE] add mqa, gqa to sequence padding unit tests

* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI

* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
2025-09-26 12:36:27 +08:00
kyle-256
b0a2d99d10 use inline function in hpp (#2922) 2025-09-25 18:29:26 -07:00
Cong Ma
a5d1e25ec7 Congma/ck tile/remove cpp 20 code (#2873)
* Remove C++20 code

C++20 features should not be used in CK. Remove all C++20 code.

* fix c++17 build

* format

* fix merge issue

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-09-25 10:34:28 -07:00
Khushbu Agarwal
b56e5d1d79 Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR  (#2878)" (#2904)"

This reverts commit f161b5b738.

* Fix: sgpr minor issue

* cyclic dependency resolved

* clang formatted

* removing unused variable

* clang formatted

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 10:32:42 -07:00
ltqin
ab22f91a7c fix fmha fwd kernel name (#2880)
* fix fmha fwd kernel name

* if the input and output types are the same, keep the original code
2025-09-24 20:00:10 -07:00
Yi DING
fe0a47a011 [CK_TILE] FMHA BWD Add D96 Instances (#2916) 2025-09-24 17:04:23 +08:00
Sami Remes
dcd33a6ecc [CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903)
* Fix issue with constexpr checks in scaling/cshuffle

* Remove IsLoadableTile

* Move amd_wave_read_first_lane before first usage
2025-09-23 23:08:18 -07:00
Thomas Ning
b159841a06 Fix the gfx950 numerical errors (#2911)
* Update grouped_gemm example and pipeline

* find the root cause error in did not enable the transpose in gfx950 correctly

* Fix v3 pipeline, row and col major

* Disable f8 datatype tests, it fails on gfx950

* fix the abd test by clear the runtime argument unsupported

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
2025-09-23 22:54:52 -07:00
asleepzzz
f161b5b738 Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit 2cbbf5dcb3.
2025-09-23 14:33:51 -07:00
Haocong WANG
959df2a155 [FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900)
* disable cast_tile_pk_fp16_fp32 on gfx950

* fix wrong encoding when hdim is not exponentiation of 2

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-24 00:59:41 +08:00
Haocong WANG
7b16782d7c [CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter

* remove unnecessary change

* enable test cases

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-23 19:59:27 +08:00
Thomas Ning
2cbbf5dcb3 [CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* substitute with the new sgpr read api

* update the CHANGELOG

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* change to static for logic

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
2025-09-23 01:23:56 -07:00
Haocong WANG
b6e8994386 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases
2025-09-23 15:00:46 +08:00
jakpiase
624c46866e [CK_TILE] Add conv bwd weight two stage support (#2855)
* resolved conflicts

* add conv bwd weight twostage

* fix one file

* fixes after review

* fixes

* fixes

* Fix

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-09-22 15:31:25 +02:00
Sami Remes
4363a82bd6 [CK_TILE] Tensor-wise scaled quant gemm kernel (#2846)
* rename gemm_group_quant to gemm_quant

* Add TensorWise quant mode

* Cshuffle epilogue tests with tensor scaling

* Add tensor quant to example

* Don't use readfirstlane for reading scales - doesn't work for some reason

* Add to changelog

* revert include - from a merge problem?

* revert common.hpp include

* revert host.hpp include

* remove unused utility function

* rename quant pipeline problem

* refactor quant tests

* remove aquant utils

* use TEST_F

* fix all tests by changing gemm config

* Use typed tests

* fix copyright
2025-09-19 16:52:35 -07:00
Illia Silin
b765fe78f3 Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883)
This reverts commit 86dd59cd01.
2025-09-19 08:15:02 -07:00
Yi DING
6cf3fdd21c [CK_TILE] FMHA BWD Fix Decode Accuracy (#2881)
* [CK_TILE] FMHA BWD Fix Decode Accuracy

* use s_waitcnt utils
2025-09-19 21:45:02 +08:00
Jeff Huang
86dd59cd01 [CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851)
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
   and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
   These test cases and benchmarks that specifically validate/benchmark the
   new padding and variable-length functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-19 17:36:49 +08:00
Anton Gorenko
2aec38f9ec [CK_TILE] FMHA Fix synchronization issues in BWD pipelines (#2876)
* Run ctest with --output-on-failure

* Fix synchronization issues in bwd pipelines

The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.

Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
2025-09-19 11:34:45 +05:00
ltqin
dd249f1cd6 Add input fp8 and output bf16 attention (#2726)
* change host using fp16 to check

* fp8 to fp8 compare

* rewrite input parameters

* add not squant

* remove some output code

* for scale = 1

* format

* saturates only for fp8

* add fp8bf16 data type

* add fp8bf16 data type

* fix test fp8 code

* add run_fp8bf16_tests

* change fmha fwd example parameter(adding fp8bf16)

* Support fp8bf16 for Aiter

* Support aiter fp8bf16 in c++

* fix comment about fp8 in readme.md

* add fp8fp32

* add fp8fp32 test

* remove range_q etc.

* format

* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type

* add fp8bf16 to data_type function

* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8

* format

* reset atol for fp8

* fix bug for atol

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-19 14:26:43 +08:00
SamiAario-AMD
47cd0d5cff Add gemm weight preshuffle pk_int_t support (#2858)
* Factor out the three separate copies of load_interleaved_pk_type into a common utility class

* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example

* Remove a duplicate function

* Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included

* I4 support introduced more failing test cases that mirror the existing ones for F8

* Simplify the check for which tests to skip (they all have F8 as A tensor type)

* Add a changelog entry

* add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-09-18 21:26:10 -07:00
Mateusz Ozga
30ab1d6a71 [CK_TILE] Multiple-ABD GEMM example (#2788)
* Multi ABD - initial commit

* Clang-foramt fix

* block gemm, unify the name of CDataType

* Apply chnages to mem-pipeline

* Rollback prefix for DType and Layout

* Gemm Kernel Basic, rename

* WMMA config

* Grouped GEMM

* Clang-format

* Dropout, name

* Review v2

* Move element_wise fn to unnary, remov old ones fn

* clang-format

* Fix issue review

* WP operator adjust to universal gemm

* v2 prepare

* Remove unused comment

* Remove vectorsize

* Rollback

* Adjust pipeline for abd

* Shuffle argument

* CI-fail fix quant

* Fix ag_br pipeline

* Failing tests

* Typo

* Single argument support
2025-09-19 01:14:11 +02:00
Rostyslav Geyyer
14bbc545ea Fix UB caused by reinterpret_cast (#2849)
* Use bit_cast instead of reinterpret_cast to avoid UB

* Apply same fix in ck_tile
2025-09-18 07:12:37 -07:00
Gino Lu
c2997f2b7f [CK_TILE] Refine pk_fp4's fill, pack, and unpack (#2845)
* fix bug

* let pack/unpack return pk_fp4_t

* fix clang-format
2025-09-17 10:54:06 +08:00
Aviral Goel
db79fad16f fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863)
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num

* WIP: add temp config to stress test numerical error correction

* refactor: remove comments
2025-09-16 18:43:19 -07:00
Aviral Goel
2723dbd332 feat(tile_window): print content of tile window for easier debugging (#2827)
* feat(tile_window): add function to print content of tile windowof static length, given a 2D range

* chore: make documentation less verbose
2025-09-16 15:47:21 -07:00
Emily Martins
dee185d80c [CK_TILE] Stream-K GEMM Implementation (#2781)
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function

Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

* StreamK Kernel RunGemm Implementation

Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:

1. The UniversalGemmKernel::RunGemm function computes num_loop based on
   a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
   SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.

Thus, this change adds a RunGemm function in the StreamKKernel class.

* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm

* Fix indexing and offset issues for StreamK

These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
  MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
  divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()

* Initial gtests for Stream-K

These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.

* Use GemmPipeline operator() variant that takes hot loop and tail num

In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.

* changes from review: comments, move readfirstlane, remove ifndef

* Switch direction of C tensor traversal & add padding guard

Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.

Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.

Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.

Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.

* Disable StreamK_M256_N256_K256_SKBlocks12 test case

This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.

---------

Co-authored-by: Astha Rai <astha.rai713@gmail.com>
2025-09-16 16:21:47 -06:00
linqunAMD
b7a806f244 [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f… (#2837)
* [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f3d7b4 )

WarpPerBlock_M * WarpPerBlock_N are not equal with ThreadPerBlock_M * ThreadPerBlock_N /warpSize. we should calculate BlockSize from WarpPerBlock_M * WarpPerBlock_N

To compatible with wave32, function GetBlockSize is added to calculate correct size in host side.

* fix blocksize for all kernel related with generic2dblockshap

* remove constexpr for blocks
2025-09-16 08:47:55 -07:00
Cong Ma
78a9823cb4 [CK TILE GEMM] Add support to convert i4 to OCP FP8/BF8 (#2853) 2025-09-16 07:18:51 -07:00
JH-Leon-KIM-AMD
804065a36b [CK Tile] Grouped conv fwd splitn support (#2776)
## What's New
  Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension.

  ## Bug Fix
  Fixed 32-bit integer overflow that caused crashes with 6+ splits:
  - Use `long_index_t` for batch offset calculations
  - Remove redundant GemmM initialization in constructors

  ## How It Works
  - Automatically splits batch dimension when tensor exceeds 2GB
  - Uses grid.z dimension for parallel processing of splits
  - Each split processes a subset of batches independently

  ## Testing
  Verified with tile_example_grouped_conv_fwd:
  - n=3000 (6 splits) ✓
  - n=3500 (7 splits) ✓
  - n=10480 (40 splits) ✓
2025-09-16 16:56:11 +03:00