Commit Graph

3885 Commits

Author SHA1 Message Date
assistant-librarian[bot]
c29a8ab871 Merge commit '58983a323287d41dff8b37c5318942d7159559dc' into develop 2025-10-03 20:12:47 +00:00
Geo Min
8580b33f32 [TheRock CI] Bumping hash for TheRock (#2972)
* Adding new hash for TheRock

* Removing package

[ROCm/composable_kernel commit: 58983a3232]
2025-10-03 12:50:16 -07:00
assistant-librarian[bot]
a3698dab8d Merge commit 'b4a4aa2b64a7a94ab04126545a3dc4f6d3eba847' into develop 2025-10-03 17:11:09 +00:00
Thomas Ning
be09203966 [CK Tile] CShuffle Tile Permute N all warp compatible (#2966)
* solve the hard_code issue of kM2

* clang format

[ROCm/composable_kernel commit: b4a4aa2b64]
2025-10-03 09:46:13 -07:00
assistant-librarian[bot]
44f405c6c1 Merge commit '4c98535456c468cbd36d39de4a92406fa3a012b6' into develop 2025-10-03 14:11:48 +00:00
Illia Silin
f1efeaa564 fix compilation errors on RHEL8 and SLES15 (#2967)
[ROCm/composable_kernel commit: 4c98535456]
2025-10-03 07:08:49 -07:00
assistant-librarian[bot]
f9c9dfa364 Merge commit '0a30c3063068dcefea2291309fbe269812d06956' into develop 2025-10-02 19:11:49 +00:00
Max Podkorytov
fbf8619599 fix build on legacy systems without cpp20 compiler (#2958)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 0a30c30630]
2025-10-02 11:54:45 -07:00
Thomas Ning
0959c6582a add the check of granularity for atomic add (#2959)
[ROCm/composable_kernel commit: cadafde722]
2025-10-02 11:15:24 -07:00
assistant-librarian[bot]
bdbab2394b Merge commit '6fc28ab4934d3668bf4ec96db1e082cf26b11384' into develop 2025-10-02 18:14:39 +00:00
Cong Ma
1aa5b318cb [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle (#2897)
* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle

When TransposeC and QuantPreshuffle are both true, Aquant generates
correct result.

* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle

- Add unit tests

* Fix bug in is_quantpreshuffle_enabled

* clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 6fc28ab493]
2025-10-02 11:13:51 -07:00
assistant-librarian[bot]
a67ea9db58 Merge commit 'a4ab33f539ac9d7209c6274958dc0285eacf3e78' into develop 2025-10-02 14:11:58 +00:00
Anton Gorenko
d1b8e66374 Fix building test_fmha_bwd_fp32 on SLES15 (#2962)
[ROCm/composable_kernel commit: a4ab33f539]
2025-10-02 07:09:49 -07:00
assistant-librarian[bot]
ba74f76cbc Merge commit 'a7da3c68b979bd46c315da09208271d26f5e2900' into develop 2025-10-01 23:11:22 +00:00
Max Podkorytov
de41af5d2e Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949)
* check in pipeline and policy

for async  load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked

solve merge conflicts

* fix cmakelists

* make it build

* fix? buffer async fence

* relax fences; it appears it only is needed between pairs of ping-pongs

* remove fences

* remove fences

* cleanup and reformat

* add steps annotations

* comment all pipeline steps / remove unexplainable syncs

* clang-format

* add comment

* cleanup kernel types for test

* fix comment

* fix hardcoded warp size

* faithfully copy block gemm from compute v4 policy to async policy

* make async test gfx950 only

* fix cmake logic

* set separate compile options for async

* refine comment in policy

* try update hotloop scheduler

* cleanup comments

* test more K block sizes

* unhardcode Ks, sort of

* add large odd test case

* fix build for quant

* add comment to hot loop scheduler and rename enum

* reformat

* reword the pipeline description

* reformat

* address review / add static asserts / typo fix

* update changelog

[ROCm/composable_kernel commit: a7da3c68b9]
2025-10-01 15:38:07 -07:00
Aviral Goel
db83ff21e8 tests: add unit tests for grouped_gemm_multi_d persistent kernels (#2941)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults

* feat(grouped_gemm_multi_d): add functionality to run persistant kernel

* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite

* tests: add unit tests for grouped_gemm_multi_d persistent kernels

* parent 5b0af640369b93849335b126d6826b204ccc43a3
author AviralGoelAMD <aviral.goel@amd.com> 1758919991 +0000
committer AviralGoelAMD <aviral.goel@amd.com> 1759338256 +0000

docs: updated changelog with new feature info

fix wp gemm bug when permuteN is false (#2935)

* fix wp gemm bug when permuteN is false

* code clean

---------

Co-authored-by: valarLip <340077269@qq.com>

fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#2939)

[CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)

* Fix validation of rotary embedding with time_kernel_

When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.

* Fix synchronization issue in splitkv combine pipeline

Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.

Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"

[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd

Use git ls-files to select candidate files for clang format

This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing.  This protects against an known issue where the repo being tested contained "stray files" from a previous test.

[CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769)

* Change the return type of run_gemm_combinations in the basic tests

* Change the return type of run_gemm_combinations in the universal tests

* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4

* Add universal GEMM test for fp8 x pk_i4

* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.

* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>

* No need for utility in test_ck_tile_elementwise_1d

* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8

* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant

* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union

* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced

* Convert from float to bf16 during compilation rather than using magic values

* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8

* Comment out the basic test for fp16 x pk_i4 as it does not pass

* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8

* Add basic and universal GEMM tests for bf8 x pk_i4

* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now

* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now

* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp

* Use explicit macros for enabling and disabling the the constexpr lookup based converters

* Fix two failing tests

* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant

* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16

* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t

Grouped Conv Bwd Data out index calculation optimizations (#2917)

* Grouped Conv Bwd Data index calculation optimizations

* fixes

* refactor instances

* gfx12 fixes

* temporary disable splitK for gfx12

[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943)

root cause:  AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.

[CK][Examples] Extending support for rdna3/4 in following examples: (#2884)

* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

---------

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

hot fix check eid range (#2924)

* hot fix check eid range

* fix clang format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

Weight Preshuffle Block Scale gemm support (#2877)

* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

* initial commit

* debugging

* working fp8 for init constant

* fp8 working with all inits

* updated block level code with comments

* changing the loop iter

* debugging

* debugging

* debugging

* code fix

* code clean up

* clang formatted

* Add comment

* code cleanup

* clang formatted

* merge conflicts fixes

* applying the latest int4 changes to the piepline

* fixing test code for updated traits

* Adding gtest

* review comments addressed

* addressing review comments

* remove c++20 code

* added flush cache changes

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>

increase time limit for AITER tests (#2948)

Code style clean-up and documentation

The following changes were made:
- Clean-up of variable namings
- Addition of README
- Removal of num_cu and occupancy args; such options are meant for
  testing purposes and should not be exposed to the user
- Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class
  since we only support one pipeline at the moment.

Fix timing issue in CK_TILE GEMM example (#2940)

* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults

* feat(grouped_gemm_multi_d): add functionality to run persistant kernel

* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint

Fix timing issue in CK_TILE GEMM example (#2940)

* style: clang format

* refactor: removed unused file

[CK] Add command option instance_index and param_mask to run partial ck test (#2889)

* [CK] Add command option instance_index and param_mask to run partial ck test

Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test

instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask

* fix CI error

* fix clang format

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[CK_TILE]enhance elementwise test  (#2683)

* enhance elementwise

* fix ci issues

[ROCm/composable_kernel commit: f2d367262f]
2025-10-01 15:22:46 -07:00
assistant-librarian[bot]
24ac4febf4 Merge commit 'a76c7b10281cf46486e6563ffeb3ee9cb4a20348' into develop 2025-10-01 22:11:21 +00:00
Max Podkorytov
1eafaa321f tweak version (#2954)
[ROCm/composable_kernel commit: a76c7b1028]
2025-10-01 15:00:41 -07:00
Mingtao Gu
8948ac317c updated mxfp4 moe gemm2 config (#2330)
Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 190ad2ccee]
2025-10-01 12:32:55 -07:00
assistant-librarian[bot]
58353e999c Merge commit '7cb1f30cfb6045bccbbd484c5e8e4715e2ebc2f3' into develop 2025-10-01 14:12:07 +00:00
Rostyslav Geyyer
69d1edb5c9 Remove default constructor to fix c++17 build issue (#2953)
* Remove default constructor to fix build issue

* Restore default CTOR, remove constexpr, add init

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 7cb1f30cfb]
2025-10-01 09:02:21 -05:00
assistant-librarian[bot]
f10bae28c9 Merge commit 'ef43078788a91b21284e697ce7707cc7d1797000' into develop 2025-09-30 22:12:47 +00:00
Sami Remes
93ba707be4 Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe (#2893)
* Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe

* also do the same for amd_buffer_addressing_builtins.hpp

* merge with develop

* fix clang format

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: ef43078788]
2025-09-30 15:12:30 -07:00
assistant-librarian[bot]
ee9718a427 Merge commit 'b60af5bde965a2bb007bb582f7836b43ca647b81' into develop 2025-09-30 16:14:10 +00:00
joyeamd
7dcf623fcb [CK_TILE]enhance elementwise test (#2683)
* enhance elementwise

* fix ci issues

[ROCm/composable_kernel commit: b60af5bde9]
2025-09-30 08:29:37 -07:00
linqunAMD
6c4ff0b062 [CK] Add command option instance_index and param_mask to run partial ck test (#2889)
* [CK] Add command option instance_index and param_mask to run partial ck test

Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test

instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask

* fix CI error

* fix clang format

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: e78a897ec0]
2025-09-30 08:24:40 -07:00
assistant-librarian[bot]
631a25de61 Merge commit '28ad8ae5d8558e147f29aba29db569fe25210947' into develop 2025-09-29 23:11:42 +00:00
Hosang
780456f1ce Fix timing issue in CK_TILE GEMM example (#2940)
[ROCm/composable_kernel commit: 28ad8ae5d8]
2025-09-29 15:34:04 -07:00
assistant-librarian[bot]
78f2779870 Merge commit 'bebf0e9d158c13d34c9f263a9551f60fa463bc66' into develop 2025-09-29 22:11:28 +00:00
Aviral Goel
7775768c88 Extend Grouped GEMM with MultiD (Single & Double Shared Memory) feature to use persistent kernel option (#2933)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults

* feat(grouped_gemm_multi_d): add functionality to run persistant kernel

* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite

* docs: improved README text based on review comments

* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint

[ROCm/composable_kernel commit: bebf0e9d15]
2025-09-29 15:03:56 -07:00
Emily Martins
7a9bff148f Code style clean-up and documentation
The following changes were made:
- Clean-up of variable namings
- Addition of README
- Removal of num_cu and occupancy args; such options are meant for
  testing purposes and should not be exposed to the user
- Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class
  since we only support one pipeline at the moment.


[ROCm/composable_kernel commit: 243118c275]
2025-09-29 15:46:24 -06:00
Emily Martins
e81e1b3221 Add CK Tile Stream-K bf16 and fp16 examples
Addition of initial CK Tile Stream-K example for bf16 and fp16. These
examples are minimal. As more functionality and gtests are added for
Stream-K (coming in future PRs), these examples will be expanded.


[ROCm/composable_kernel commit: a3499e38b2]
2025-09-29 15:46:24 -06:00
assistant-librarian[bot]
1ff47b0020 Merge commit '35e116f5c088dc7673856e8a78539243e61044dc' into develop 2025-09-29 20:24:04 +00:00
Illia Silin
c2be5afb7a increase time limit for AITER tests (#2948)
[ROCm/composable_kernel commit: 35e116f5c0]
2025-09-29 13:11:42 -07:00
Khushbu Agarwal
7c20b1f690 Weight Preshuffle Block Scale gemm support (#2877)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

* initial commit

* debugging

* working fp8 for init constant

* fp8 working with all inits

* updated block level code with comments

* changing the loop iter

* debugging

* debugging

* debugging

* code fix

* code clean up

* clang formatted

* Add comment

* code cleanup

* clang formatted

* merge conflicts fixes

* applying the latest int4 changes to the piepline

* fixing test code for updated traits

* Adding gtest

* review comments addressed

* addressing review comments

* remove c++20 code

* added flush cache changes

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>

[ROCm/composable_kernel commit: 81458a6681]
2025-09-29 12:46:37 -07:00
assistant-librarian[bot]
3cf7343e08 Merge commit '2e9428eb63be091b109537e082aa7f0fc05a634d' into develop 2025-09-29 17:12:15 +00:00
carlushuang
47b8632296 hot fix check eid range (#2924)
* hot fix check eid range

* fix clang format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 2e9428eb63]
2025-09-29 09:38:38 -07:00
assistant-librarian[bot]
2e7d600076 Merge commit '2b684f0a7d2317b1b1f001716acb62f566cc71ee' into develop 2025-09-29 16:12:12 +00:00
Michał Kulikowski
ac4ecdacc5 [CK][Examples] Extending support for rdna3/4 in following examples: (#2884)
* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

---------

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

[ROCm/composable_kernel commit: 2b684f0a7d]
2025-09-29 09:05:04 -07:00
assistant-librarian[bot]
091c5200ce Merge commit '0f04f020d979875de01274901b8f3cc15e600a8f' into develop 2025-09-29 15:12:26 +00:00
yinglu
f9daaa9724 fix:tf32:fix build fail for all supported targets (#2942)
* fix:tf32:fix build fail for all supported targets

* new fix code

[ROCm/composable_kernel commit: 0f04f020d9]
2025-09-29 08:04:11 -07:00
linqunAMD
b6cb76a555 [CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943)
root cause:  AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.

[ROCm/composable_kernel commit: 769c58f133]
2025-09-29 07:56:33 -07:00
Bartłomiej Kocot
ef933ee241 Grouped Conv Bwd Data out index calculation optimizations (#2917)
* Grouped Conv Bwd Data index calculation optimizations

* fixes

* refactor instances

* gfx12 fixes

* temporary disable splitK for gfx12

[ROCm/composable_kernel commit: 5477811670]
2025-09-29 15:59:11 +02:00
assistant-librarian[bot]
f9767142cf Merge commit '0f10e6d9218ce9d00a34a66572c0686dce1e45ea' into develop 2025-09-29 11:12:04 +00:00
SamiAario-AMD
4bc708f401 [CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769)
* Change the return type of run_gemm_combinations in the basic tests

* Change the return type of run_gemm_combinations in the universal tests

* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4

* Add universal GEMM test for fp8 x pk_i4

* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.

* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>

* No need for utility in test_ck_tile_elementwise_1d

* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8

* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant

* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union

* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced

* Convert from float to bf16 during compilation rather than using magic values

* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8

* Comment out the basic test for fp16 x pk_i4 as it does not pass

* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>

* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8

* Add basic and universal GEMM tests for bf8 x pk_i4

* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now

* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now

* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp

* Use explicit macros for enabling and disabling the the constexpr lookup based converters

* Fix two failing tests

* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant

* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16

* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t

[ROCm/composable_kernel commit: 0f10e6d921]
2025-09-29 13:34:47 +03:00
assistant-librarian[bot]
2593ecf5b5 Merge commit 'e8842e3c1fe75f4967105914032aced63e233225' into develop 2025-09-27 22:11:27 +00:00
John Afaganis
3e6bc62993 Use git ls-files to select candidate files for clang format
This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing.  This protects against an known issue where the repo being tested contained "stray files" from a previous test.

[ROCm/composable_kernel commit: e8842e3c1f]
2025-09-27 15:47:31 -06:00
assistant-librarian[bot]
0d053396c5 Merge commit '1edd250115bc3edd987b7d038f61290a0460d0a3' into develop 2025-09-27 13:13:37 +00:00
Anton Gorenko
8118d84f77 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd

[ROCm/composable_kernel commit: 1edd250115]
2025-09-27 18:03:48 +05:00
assistant-librarian[bot]
477a605961 Merge commit 'c6bfd97c2d186fd03866c3f5d460bb680ce667a1' into develop 2025-09-27 03:19:57 +00:00