Commit Graph

745 Commits

Author SHA1 Message Date
Anton Gorenko
8118d84f77 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd

[ROCm/composable_kernel commit: 1edd250115]
2025-09-27 18:03:48 +05:00
Anton Gorenko
d0142f8223 [CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_

When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.

* Fix synchronization issue in splitkv combine pipeline

Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.

Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"

[ROCm/composable_kernel commit: c6bfd97c2d]
2025-09-27 08:16:10 +05:00
lalala-sh
857566c8aa fix wp gemm bug when permuteN is false (#2935)
* fix wp gemm bug when permuteN is false

* code clean

---------

Co-authored-by: valarLip <340077269@qq.com>

[ROCm/composable_kernel commit: ee9769616a]
2025-09-26 13:28:54 -07:00
Aviral Goel
5ebdd30e58 Integrate Multi D GEMMs into Grouped GEMMs along with unit tests (#2923)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* feat: generalized grouped_gemm_kernel.hpp

* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* docs: add multi d info and trim down outdated content

* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite

[ROCm/composable_kernel commit: a44bea45b2]
2025-09-26 09:59:58 -07:00
Yi DING
5d7bc8b578 [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918)
[ROCm/composable_kernel commit: 32773fe5cb]
2025-09-26 16:42:59 +08:00
Jeff Huang
0957b78f76 Add sequence padding and variable length support in fmha (#2932)
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
   `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
   `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
   specifically validate/benchmark the new padding and variable-length
   functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

* [CK_TILE] add mqa, gqa to sequence padding unit tests

* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI

* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel

[ROCm/composable_kernel commit: 518d24e662]
2025-09-26 12:36:27 +08:00
emezh
3c207a18b0 Verify HostTensorDescriptor when it is created (#2829)
* add proper GEMM layout verification

* Handle "auto" strides.

CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides).
CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims.
ValidateStrides throws if any of the tensor's strides is <=0.
profile_gemm_multiply_add updated to support "auto" strides for tensors.

Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes)
auto-strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1
Note, -1 should be deprecated (use 0 instead)

explicit strides (same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128

explicit strides (not the same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138

mix of explicit and auto strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0

invalid stride
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64
	terminate called after throwing an instance of 'std::runtime_error'
	  what():  Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1
	Aborted (core dumped)

* - add more names to ck::tensor_layout for easier namespace hierarchy checking
- updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp

* added handling of partially initialized strides for GEMM. fixed more tests.

* clang-format and more fixes

* replace long dash by a simple hyphen - causes build failure in CK codegen.

* increase sizeof input, otherwise output size becomes zero or negative with large filter size

* select stride based on layout

* specify layout explicitly to avoid errors in HostTensorDescriptor creation

* add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor`

* Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors

* fix test (incl removing invalid config)

* fix moe examples:
- (in .cpp) add layout argument to non-2D tensors
- (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index)

* fix moe_gemm2 example.

* fix profile and wmma examples

* clean-up early mods for ckprofile. verified with:
```
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138
#
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128
#
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0
# ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0    # not implemented
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128
#
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138
#
example_gemm_add_multiply_dl_fp16
example_gemm_add_multiply_xdl_fp16
#
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128
```

* temporary skip first 8 test configs - they throw error

* temporary skip first 8 test configs in wmma too - they throw error

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: db2524be2d]
2025-09-25 18:22:13 -07:00
Cong Ma
578566f809 Congma/ck tile/remove cpp 20 code (#2873)
* Remove C++20 code

C++20 features should not be used in CK. Remove all C++20 code.

* fix c++17 build

* format

* fix merge issue

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: a5d1e25ec7]
2025-09-25 10:34:28 -07:00
Illia Silin
80f0af1e91 fix clang format (#2926)
[ROCm/composable_kernel commit: 9f6fc9fe09]
2025-09-25 09:35:35 -07:00
linqunAMD
0c45597a4e [CK] Fix misc issues in CK examples (#2890)
* [CK] Fix misc CK issues

* revert fp8 change, it causes CI fail.

* resubmit fp8 change

[ROCm/composable_kernel commit: f076f207ce]
2025-09-24 11:28:20 -07:00
Yi DING
02db6094b9 [CK_TILE] FMHA BWD Add D96 Instances (#2916)
[ROCm/composable_kernel commit: fe0a47a011]
2025-09-24 17:04:23 +08:00
Johannes Graner
408b3945c3 [CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894)
* Invoker for grouped_conv_fwd

* Invoker for grouped_conv_bwd_data

* Fix incorrect out layout identifier

[ROCm/composable_kernel commit: 15fff74503]
2025-09-24 10:22:38 +02:00
Jingwei Liao
e868ffa390 add fmha dtype fp32 (#2914)
[ROCm/composable_kernel commit: 6805684788]
2025-09-24 15:28:39 +08:00
Thomas Ning
8a563fc79d Fix the gfx950 numerical errors (#2911)
* Update grouped_gemm example and pipeline

* find the root cause error in did not enable the transpose in gfx950 correctly

* Fix v3 pipeline, row and col major

* Disable f8 datatype tests, it fails on gfx950

* fix the abd test by clear the runtime argument unsupported

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>

[ROCm/composable_kernel commit: b159841a06]
2025-09-23 22:54:52 -07:00
Haocong WANG
0eede5af24 [CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter

* remove unnecessary change

* enable test cases

---------

Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: 7b16782d7c]
2025-09-23 19:59:27 +08:00
Haocong WANG
d85ca87d97 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases

[ROCm/composable_kernel commit: b6e8994386]
2025-09-23 15:00:46 +08:00
Yi DING
bfa145c418 FMHA BWD Avoid SetZero (#2799)
[ROCm/composable_kernel commit: ad259eeae2]
2025-09-23 14:37:48 +08:00
Enrico Degregori
12225ce645 Wmma support for multiple ABD GEMM (#2803)
* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Fix batched gemm for multiABD gridwise implementation

* Fix gemm_universal_reduce with multiABDs gridwise implementation

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 3d29bff2f0]
2025-09-22 18:49:06 -07:00
jakpiase
30403d077b [CK_TILE] Add conv bwd weight two stage support (#2855)
* resolved conflicts

* add conv bwd weight twostage

* fix one file

* fixes after review

* fixes

* fixes

* Fix

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 624c46866e]
2025-09-22 15:31:25 +02:00
Sami Remes
8d2a444c55 [CK_TILE] Tensor-wise scaled quant gemm kernel (#2846)
* rename gemm_group_quant to gemm_quant

* Add TensorWise quant mode

* Cshuffle epilogue tests with tensor scaling

* Add tensor quant to example

* Don't use readfirstlane for reading scales - doesn't work for some reason

* Add to changelog

* revert include - from a merge problem?

* revert common.hpp include

* revert host.hpp include

* remove unused utility function

* rename quant pipeline problem

* refactor quant tests

* remove aquant utils

* use TEST_F

* fix all tests by changing gemm config

* Use typed tests

* fix copyright

[ROCm/composable_kernel commit: 4363a82bd6]
2025-09-19 16:52:35 -07:00
Illia Silin
ee43f0f0be Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883)
This reverts commit 7ede589f4b.

[ROCm/composable_kernel commit: b765fe78f3]
2025-09-19 08:15:02 -07:00
Jeff Huang
7ede589f4b [CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851)
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
   and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
   These test cases and benchmarks that specifically validate/benchmark the
   new padding and variable-length functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: 86dd59cd01]
2025-09-19 17:36:49 +08:00
ltqin
fd80c78f50 Add input fp8 and output bf16 attention (#2726)
* change host using fp16 to check

* fp8 to fp8 compare

* rewrite input parameters

* add not squant

* remove some output code

* for scale = 1

* format

* saturates only for fp8

* add fp8bf16 data type

* add fp8bf16 data type

* fix test fp8 code

* add run_fp8bf16_tests

* change fmha fwd example parameter(adding fp8bf16)

* Support fp8bf16 for Aiter

* Support aiter fp8bf16 in c++

* fix comment about fp8 in readme.md

* add fp8fp32

* add fp8fp32 test

* remove range_q etc.

* format

* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type

* add fp8bf16 to data_type function

* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8

* format

* reset atol for fp8

* fix bug for atol

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: dd249f1cd6]
2025-09-19 14:26:43 +08:00
Mateusz Ozga
64e1f86daf [CK_TILE] Multiple-ABD GEMM example (#2788)
* Multi ABD - initial commit

* Clang-foramt fix

* block gemm, unify the name of CDataType

* Apply chnages to mem-pipeline

* Rollback prefix for DType and Layout

* Gemm Kernel Basic, rename

* WMMA config

* Grouped GEMM

* Clang-format

* Dropout, name

* Review v2

* Move element_wise fn to unnary, remov old ones fn

* clang-format

* Fix issue review

* WP operator adjust to universal gemm

* v2 prepare

* Remove unused comment

* Remove vectorsize

* Rollback

* Adjust pipeline for abd

* Shuffle argument

* CI-fail fix quant

* Fix ag_br pipeline

* Failing tests

* Typo

* Single argument support

[ROCm/composable_kernel commit: 30ab1d6a71]
2025-09-19 01:14:11 +02:00
Yi DING
8bc9d6226d [CK_TILE] FMHA Test Ignore Known Errors (#2872)
[ROCm/composable_kernel commit: 7ee7915e94]
2025-09-18 16:51:21 +08:00
yinglu
19463895a8 TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852)
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"

This reverts commit 954db22b39.

* fix compile error on gf12x

* only run tf32 example on gfx942

* only build tf32 instance on gfx942

* ckProfiler:only support tf32 in gfx942

* delete unuseful messages

[ROCm/composable_kernel commit: dd7af118d7]
2025-09-17 14:50:15 -07:00
Aviral Goel
a7a7fa13bb build(grouped_gemm): added appropriate compiler flag to resolve numerical error for fp8 on gfx950 (#2868)
[ROCm/composable_kernel commit: 7c934b72ab]
2025-09-17 11:04:21 -07:00
Michał Kulikowski
5334a45c0e [CK][Examples] - fixing grouped_conv_bwd_weight command parser. (#2840)
-added parameter to change group count for grouped_gemm examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

[ROCm/composable_kernel commit: 5c4f52a02a]
2025-09-17 10:39:48 -07:00
Aviral Goel
0fb1cfa4b7 fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863)
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num

* WIP: add temp config to stress test numerical error correction

* refactor: remove comments

[ROCm/composable_kernel commit: db79fad16f]
2025-09-16 18:43:19 -07:00
Wojciech Laskowski
302398f3fd Added wmma support for gemm quantization: (#2841)
- profiler for gemm quantization for DL/XDL
- tests for gemm quantization for DL/XDL
- implementation for gemm quantization for WMMA
- profiler/tests for gemm qunatization for WMMA

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: f97b2a3f5d]
2025-09-16 16:23:29 -07:00
Aviral Goel
ce0feff1af test(grouped_gemm): add gtests for the example/grouped_gemm_preshuffle to ensure its integrity (#2811)
* test(grouped_gemm): add gtests for the example to maintain its integrity

* test(grouped_gemm_preshuffle): add prefill variant to testbed to cover wider range

* fix: removed residue code to make b_shuffle() work again

* test(grouped_gemm_preshuffle): limit the test suite to gfx942 arch as it fails on gfx90a

* build: add gfx950 as build target for gtests

* test(grouped_gemm_preshuffle): temporarily disable fp8 prec tests due to numerical errors

* fix(grouped_gemm_preshuffle): resolved fp8 tests failure on gfx950 by adding correct compiler flag

[ROCm/composable_kernel commit: 48e08c6429]
2025-09-16 15:43:30 -07:00
Haocong WANG
4db9e47cd5 [CK_TILE] fix bug when iperm =0 in fmha fwd (#2820)
* fix bug when iperm =0 in fmha fwd

* Disable f8 fmha smoke test until fix pr merged

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 59cb906482]
2025-09-16 15:07:10 +08:00
Po Yen Chen
9c6cca5bc5 [CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels

* Use more intuitive remap_opt value

* Fallback to origin remapping if seqlen_q >= 64K

* Use GenericAttentionMask to reduce mask computation

* Avoid unnecessary boundary check for IsMasking=false case

* Fix wrong kernel entry specifier

* Add s_nop to prevent delay wave0-3

* Refine scheduling

* Remove unnecessary sched_group_barrier()

* Move sched_group_barrier() call to scheduler

* Replace inline asm s_setprio with intrinsics

* Rephrase comments

* Expend some o_acc rescaling insts to avoid SIMD idle

* Fix block idx special mapping logic

* Tune block index mapping for causal=False cases

* Tune block index mapping for causal=True cases

* Fix wrong vmcnt()

* Remove parameter name

* Use boolean option for turn on/off causal mask

* Update benchmark_fwd_v3.sh option usages

* Add option if compiler support it

[ROCm/composable_kernel commit: 7fbc9d6c97]
2025-09-16 11:32:38 +08:00
linqunAMD
71dc8a9d4d Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12

* update cmake file

* fix all instance build (cmake)

* fix batched_gemm_gemm(cmake)

* rebase cmake files

* fix cmake build error

* remve CK_ENABLE_DYNAMIC_WARP_SIZE

* update cmake build error2

* fix gfx11 build

CK_USE_XDL is enabled on gfx11 and gfx12

* fix gfx10 build

* fix gfx11 error

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: f22740df82]
2025-09-15 10:59:25 -07:00
Illia Silin
954db22b39 Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)
This reverts commit d4dbf93119.

[ROCm/composable_kernel commit: 03b59f8c76]
2025-09-15 08:27:04 -07:00
lym
d4dbf93119 feature:tf32:add initial conv3d fwd kernel support (#2763)
[ROCm/composable_kernel commit: c51102144f]
2025-09-15 21:03:00 +08:00
Cong Ma
9b65e9ec43 [CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839)
- TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill

[ROCm/composable_kernel commit: e5d73da2da]
2025-09-13 20:54:08 -07:00
linqunAMD
930f95d4a6 [CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821)
* [CK_TILE] Enable ck_tile test on gfx11 & gfx12

* revert an unnecessary change

* enable pk_int4 on gfx11 & gfx12

* revert .pre-commit-config.yaml

[ROCm/composable_kernel commit: b0ee317d83]
2025-09-12 12:45:14 -07:00
Anton Gorenko
e76f294f85 [CK_TILE] FMHA Reduce build time by disabling instances that are not tested (#2834)
* Use lse = false for PagedKV tests

There are no instances with lse = true so splitkv is actually launched
as a fallback.

* Reduce build time by disabling instances that are not tested

[ROCm/composable_kernel commit: 847834a408]
2025-09-12 12:44:25 -07:00
Wojciech Laskowski
f2edb06bb0 WMMA support for GEMM reduce (#2823)
Added gemm + reduce instance library for RDNA4. This includes:

- New device implementation running GEMM and reduction kernel
- instances for wmma (xdl parity)
- examples for wmma (xdl parity)
- tests for existing xdl and wmma

[ROCm/composable_kernel commit: b25d4d684a]
2025-09-12 21:36:43 +02:00
linqunAMD
07def6b13d Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724)
* Fix example

* fix build error

* update pk_i4 & moe test case

* fix all instance build (examples)

* fix batched_gemm_gemm (example)

* disable example_gemm_bias_softmax_gemm_permute on gfx11

* remove unnecessary disable gfx11

* update tests

* update tests2

[ROCm/composable_kernel commit: 321627aec5]
2025-09-12 08:17:07 -07:00
Michał Kulikowski
064eb037db [CK][EXAMPLES] (#2826)
-Added parameter to enable/disable verification and timing of kernel in various examples that missed it.
-Added parameter to change number of groups to execute in grouped_gemm_examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

[ROCm/composable_kernel commit: ffe9775e70]
2025-09-11 12:33:00 -07:00
Aviral Goel
6e774b512a fix(copyright header): add header to missing files (#2807)
[ROCm/composable_kernel commit: f3239395dc]
2025-09-11 12:27:08 -07:00
Cong Ma
741ddfe584 [CK TILE GEMM] Fixed the regression issue with transpose C in Quant Gemm (#2819)
The numerical error was introduced after merging row/col quant. And it is fixed.

[ROCm/composable_kernel commit: 2ed39f8d91]
2025-09-11 11:38:16 -07:00
linqunAMD
a303edcfb0 [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808)
* [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12

* fix gemm_splitk_two_stage

* revert .pre-commit-config.yaml

[ROCm/composable_kernel commit: 60d3e8f504]
2025-09-11 07:27:33 -07:00
linqunAMD
eaf1fa7edb [CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810)
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 0b9a638f26]
2025-09-11 12:41:20 +08:00
Khushbu Agarwal
2eb6cbb6a8 [CK-Tile] Fix quant example code (#2813)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

---------

Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 80a61afb9b]
2025-09-10 17:15:39 -07:00
Anton Gorenko
57d63b3e70 [CK_TILE] Add gtests for FMHA (#2744)
* Improve random number generation

* use different seed for each input (Q, K, V...);
* use deterministic generation of:
  * seqstart_q/k (for group mode);
  * block_table (for paged-kvcahe);
  * cache_batch_idx (for kvcache);

* Extract arg_parser-related code from run functions to use them as tests

* Split examples into main programs and fmha runners, build instances separately

* Add dummy tests that use instances and runners

* Fix a missed corner case of f32->f8 conversion

When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.

* Fix incorrect fp8 scales for P and O in validation code

DataTypeConfig was incorrectly compared with fp8_t.

* Add host generation of dropout random values and use it for validation

Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.

* Implement tests from smoke_test_bwd.sh

* Return result as enum to distinguish failure and missing instance

* Add tests for bwd features: bias, alibi, dropout

* Implement tests from smoke_test_fwd.sh

* Pass seqlen_q/k as vectors to fwd and bwd runners

* Add tests for fwd features: bias, alibi, dropout

* Add tests for pagedkv and splitkv

* Fix conditions when to use splitkv and pagedkv kernels

splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.

* Add tests for appendkv

* Use is_v_rowmajor = true because there are no instances with column layout anymore

* Split public and private compile options for instances

Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.

* Improve parsing validation in bias and mask

* Pass bias as string for consistency with mask

* Catch parsing and other exceptions

* Add bwd test for deterministic flag

* Initialize fp8 tensors (-init=ufq) similarly to uf

* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null

seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).

* Fix use_pagedkv value when kvcache = true but page_block_size = 0

In this case block_table_ptr is nullptr which is accessed in the kernel.

* Clean up bwd tests

* Unify fwd tests for f16/bf16 and fp8

* Use better explicit instantiation declaration for fmha_bwd<2>

* Use the same seed for all tests, allow to override it with env variable

* Undo clang-format of one irrelevant file

For some reason my local clang-format-18 and the one in CI work differently.

* Do not build instances and tests on unsupported archs

* Build instance libraries as OBJECT library

* CI: Enable sccache for HIP

There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache

* Add tests to REGRESSION_TESTS

* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0

The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.

* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options

The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.

* Fix formatting and names

[ROCm/composable_kernel commit: ec006bb8e0]
2025-09-10 08:06:14 +05:00
linqunAMD
5b17f135b7 [CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795)
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.

Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape

[ROCm/composable_kernel commit: c254f3d7b4]
2025-09-10 08:29:20 +08:00
linqunAMD
7022f756e6 [CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790)
1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout
2. Add FlatmmConfig16_Wmma for gfx11 and gfx12

[ROCm/composable_kernel commit: df4ee556d6]
2025-09-10 08:28:00 +08:00