Commit Graph

2413 Commits

Author SHA1 Message Date
Enrico Degregori
8ec5908e0e Fix copyright 2025-09-29 08:27:00 +00:00
Enrico Degregori
ca078f8fc1 Uncomment scale instances 2025-09-29 08:26:49 +00:00
Enrico Degregori
23c9189103 Add examples 2025-09-29 08:26:36 +00:00
Enrico Degregori
c71f2f25eb Add multiple Ds instances 2025-09-29 08:26:23 +00:00
Enrico Degregori
e6b7d5ed65 Add two stage instances (xdl parity) 2025-09-29 08:26:09 +00:00
Enrico Degregori
0b7f0cbbeb Add instances for xdl parity (for pipeline v1) 2025-09-29 08:25:52 +00:00
Enrico Degregori
b1c6973ad1 Remove workaround for 1x1Stride1Pad0 conv specialization 2025-09-29 08:25:28 +00:00
Enrico Degregori
305cbbc3ac Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization 2025-09-29 08:24:56 +00:00
Enrico Degregori
82133901a4 Fix bugs in device implementation:
- rdna3 compilation error
 - gridwise layouts (need to be correct to ensure that CheckValidaity()
   works correctly)
2025-09-29 08:24:31 +00:00
Enrico Degregori
2ba2c5d1b4 check gridwise level validity in device impl for 1 stage D0 2025-09-29 08:24:09 +00:00
Enrico Degregori
6514b15fb5 Add generic instances for bf16 f32 bf16 2025-09-29 08:23:55 +00:00
Enrico Degregori
9dbbb07953 Fix bug and disable splitK=-1 tests for wmma 2025-09-29 08:23:39 +00:00
Enrico Degregori
37b6d28dc0 Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation

See merge request amd/ai/composable_kernel!38
2025-09-29 08:22:53 +00:00
Enrico Degregori
9d7a01f82f Convolution bwd weight device implementation 2025-09-29 08:20:47 +00:00
John Afaganis
e8842e3c1f Use git ls-files to select candidate files for clang format
This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing.  This protects against an known issue where the repo being tested contained "stray files" from a previous test.
2025-09-27 15:47:31 -06:00
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00
Anton Gorenko
c6bfd97c2d [CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_

When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.

* Fix synchronization issue in splitkv combine pipeline

Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.

Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
2025-09-27 08:16:10 +05:00
emezh
2aa06fbd45 fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#2939) 2025-09-26 22:55:18 -04:00
lalala-sh
ee9769616a fix wp gemm bug when permuteN is false (#2935)
* fix wp gemm bug when permuteN is false

* code clean

---------

Co-authored-by: valarLip <340077269@qq.com>
2025-09-26 13:28:54 -07:00
Aviral Goel
a44bea45b2 Integrate Multi D GEMMs into Grouped GEMMs along with unit tests (#2923)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature

* feat: generalized grouped_gemm_kernel.hpp

* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0

* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments

* fix: segfault fix by passing correct parameters for d tensors

* docs: add multi d info and trim down outdated content

* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility

* style: clang format

* fix: incorrect validation method and Dtensor layout in test suite
2025-09-26 09:59:58 -07:00
Geo Min
e40c0acef2 [TheRock CI] Adding MIOpen at HEAD (#2929)
* Adding MIOpen at HEAD

* Adding container and also adding CI run for .github paths

* Adding correct flags

* Adding patches

* Adding exception for ck

* rocm-libraries at new path

* adding global safe dir

* reorder

* Fixing paths

* Adding sharding
2025-09-26 09:08:15 -07:00
rahjain-amd
e92e69318e Disable Rapid Json to be used by Default (#2936)
To enable the json dump we can now build with -DCK_ENABLE_JSON_DUMP=1
2025-09-26 09:05:35 -07:00
Christopher Millette
f92b3c7a1e Update CODEOWNERS 2025-09-26 09:41:33 -06:00
Yi DING
32773fe5cb [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918) 2025-09-26 16:42:59 +08:00
Jeff Huang
518d24e662 Add sequence padding and variable length support in fmha (#2932)
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
   `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
   `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
   specifically validate/benchmark the new padding and variable-length
   functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

* [CK_TILE] add mqa, gqa to sequence padding unit tests

* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI

* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
2025-09-26 12:36:27 +08:00
kyle-256
b0a2d99d10 use inline function in hpp (#2922) 2025-09-25 18:29:26 -07:00
emezh
db2524be2d Verify HostTensorDescriptor when it is created (#2829)
* add proper GEMM layout verification

* Handle "auto" strides.

CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides).
CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims.
ValidateStrides throws if any of the tensor's strides is <=0.
profile_gemm_multiply_add updated to support "auto" strides for tensors.

Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes)
auto-strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1
Note, -1 should be deprecated (use 0 instead)

explicit strides (same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128

explicit strides (not the same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138

mix of explicit and auto strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0

invalid stride
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64
	terminate called after throwing an instance of 'std::runtime_error'
	  what():  Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1
	Aborted (core dumped)

* - add more names to ck::tensor_layout for easier namespace hierarchy checking
- updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp

* added handling of partially initialized strides for GEMM. fixed more tests.

* clang-format and more fixes

* replace long dash by a simple hyphen - causes build failure in CK codegen.

* increase sizeof input, otherwise output size becomes zero or negative with large filter size

* select stride based on layout

* specify layout explicitly to avoid errors in HostTensorDescriptor creation

* add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor`

* Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors

* fix test (incl removing invalid config)

* fix moe examples:
- (in .cpp) add layout argument to non-2D tensors
- (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index)

* fix moe_gemm2 example.

* fix profile and wmma examples

* clean-up early mods for ckprofile. verified with:
```
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138
#
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128
#
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0
# ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0    # not implemented
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128
#
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138
#
example_gemm_add_multiply_dl_fp16
example_gemm_add_multiply_xdl_fp16
#
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128
```

* temporary skip first 8 test configs - they throw error

* temporary skip first 8 test configs in wmma too - they throw error

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 18:22:13 -07:00
Illia Silin
ec4d16b991 Enable CI on gfx1100 (#2930)
* run CI on different versions of gfx11

* do not use gfx1151 systems
2025-09-25 16:10:54 -07:00
Illia Silin
8c1a959913 use default docker for build/test on gfx950 (#2928) 2025-09-25 10:40:45 -07:00
Cong Ma
a5d1e25ec7 Congma/ck tile/remove cpp 20 code (#2873)
* Remove C++20 code

C++20 features should not be used in CK. Remove all C++20 code.

* fix c++17 build

* format

* fix merge issue

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-09-25 10:34:28 -07:00
Khushbu Agarwal
b56e5d1d79 Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR  (#2878)" (#2904)"

This reverts commit f161b5b738.

* Fix: sgpr minor issue

* cyclic dependency resolved

* clang formatted

* removing unused variable

* clang formatted

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 10:32:42 -07:00
Illia Silin
64e61b8647 Add AITER test_mha_varlen (#2927)
* add aiter test_mha_varlen

* don't fail until all aiter test run

* use the original way to run tests, just add new test
2025-09-25 10:00:20 -07:00
Illia Silin
9f6fc9fe09 fix clang format (#2926) 2025-09-25 09:35:35 -07:00
Jobbins
929291741d [Jenkins] Remove 'Jenkins - ' prefix (#2920)
The prefix is causing the status updates from
gitStatusWrapper to be unique to the status updates that
are created by the Jenkins server, which creates duplicates
2025-09-25 09:08:29 -06:00
ltqin
ab22f91a7c fix fmha fwd kernel name (#2880)
* fix fmha fwd kernel name

* if the input and output types are the same, keep the original code
2025-09-24 20:00:10 -07:00
yinglu
df97a286d5 Conv:TF32: add more instances - 1 (#2867)
* conv:tf32:add more instances
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* remove gnhwc/ngchw/ngcdhw instances
2025-09-25 09:27:18 +08:00
linqunAMD
f076f207ce [CK] Fix misc issues in CK examples (#2890)
* [CK] Fix misc CK issues

* revert fp8 change, it causes CI fail.

* resubmit fp8 change
2025-09-24 11:28:20 -07:00
Illia Silin
8fe3838c65 Upgrade to ROCm7.0.1 compiler. (#2909)
* upgrade default docker to rocm7.0.1

* turn on build and test on gfx950 by default

* use rocm-dev instead of rocm

* link libhiprtc for codegen targets

* resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards

---------

Co-authored-by: Astha Rai <astha.rai713@gmail.com>
therock-7.9.0
2025-09-24 10:00:53 -07:00
Yi DING
fe0a47a011 [CK_TILE] FMHA BWD Add D96 Instances (#2916) 2025-09-24 17:04:23 +08:00
Johannes Graner
15fff74503 [CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894)
* Invoker for grouped_conv_fwd

* Invoker for grouped_conv_bwd_data

* Fix incorrect out layout identifier
2025-09-24 10:22:38 +02:00
Jingwei Liao
6805684788 add fmha dtype fp32 (#2914) 2025-09-24 15:28:39 +08:00
Sami Remes
dcd33a6ecc [CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903)
* Fix issue with constexpr checks in scaling/cshuffle

* Remove IsLoadableTile

* Move amd_wave_read_first_lane before first usage
2025-09-23 23:08:18 -07:00
Thomas Ning
b159841a06 Fix the gfx950 numerical errors (#2911)
* Update grouped_gemm example and pipeline

* find the root cause error in did not enable the transpose in gfx950 correctly

* Fix v3 pipeline, row and col major

* Disable f8 datatype tests, it fails on gfx950

* fix the abd test by clear the runtime argument unsupported

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
2025-09-23 22:54:52 -07:00
asleepzzz
f161b5b738 Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit 2cbbf5dcb3.
2025-09-23 14:33:51 -07:00
Haocong WANG
959df2a155 [FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900)
* disable cast_tile_pk_fp16_fp32 on gfx950

* fix wrong encoding when hdim is not exponentiation of 2

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-24 00:59:41 +08:00
Haocong WANG
7b16782d7c [CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter

* remove unnecessary change

* enable test cases

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-23 19:59:27 +08:00
Thomas Ning
2cbbf5dcb3 [CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* substitute with the new sgpr read api

* update the CHANGELOG

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* change to static for logic

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
2025-09-23 01:23:56 -07:00
Haocong WANG
b6e8994386 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases
2025-09-23 15:00:46 +08:00
Yi DING
ad259eeae2 FMHA BWD Avoid SetZero (#2799) 2025-09-23 14:37:48 +08:00
Enrico Degregori
3d29bff2f0 Wmma support for multiple ABD GEMM (#2803)
* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Fix batched gemm for multiABD gridwise implementation

* Fix gemm_universal_reduce with multiABDs gridwise implementation

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-22 18:49:06 -07:00