Commit Graph

395 Commits

Author SHA1 Message Date
aska-0096
7bdf6a7eef merge develop and solve conflicts 2025-08-22 03:15:51 +00:00
aska-0096
f21e916a8c Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue 2025-08-22 03:15:21 +00:00
Po Yen Chen
0db21053e6 [CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669)
* Allow return VGPR get_warp_id() value

* Avoid using SALU in async_load_raw()
2025-08-22 10:17:05 +08:00
Po Yen Chen
4a7ecce096 [CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549)
* Support async load dwordx4

* Enlarge load size on gfx950
2025-08-22 10:13:47 +08:00
Yi DING
4cfa2c7158 [CK_TILE] FMHA BWD Fix Compilation with Bias (#2682)
* [CK_TILE] FMHA BWD Fix Compilation with Bias

* Fix appendkv kApplyRoPE
2025-08-22 10:01:10 +08:00
Bartłomiej Kocot
4212bbc170 [CK Tile] Grouped convolution backward data (#2652)
* base working version for single groupped conv bwd data

* Fix 2d descriptor

* fix groups

* Add 3d support

* fixes

* fixes

* fixes

---------

Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
2025-08-20 05:29:57 -07:00
Emily Martins
071165919f [CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes (#2681)
* CK Tile Stream K Device Ops

Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.

* Stream K Device Ops Fixes

- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
  ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation

* Add WarpTile dimensions to GetName function in StreamKKernel class

* Fix typos in StreamKHostArgs class description.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Apply clang format on updated comment for StreamKHostArgs

* Explicitly specify type for StreamKReductionStrategy enum

* Remove unecessary scopes

* Unify the commenting style to inline comments

* Add explicit casts for occupancy and num_cu in MakeKernelArgs function

Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.

* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.

* Switch type of num_sk_blocks from uint32_t to int32_t

This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-08-19 15:08:52 -06:00
joyeamd
a1589a9667 fix grouped gemm example when wave32 enabled (#2707)
1, delete some unused variables
2, fix BlockSize when wave32 enabled
2025-08-19 16:20:43 +08:00
Ding, Yi
a7dd2a7d13 update bwd 2025-08-19 02:17:43 +00:00
Kevin Choi
380aa8f311 add restrict to applicable functions 2025-08-18 19:37:40 +00:00
linqunAMD
9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594)
* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-18 10:08:31 -07:00
Tianyuan Wu
7310830d14 Fix CI build error (#2695)
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
2025-08-18 01:45:40 -07:00
Kevin Choi
75c4b9372f remove noinline attr as it causes a lot more s_waitcnt's 2025-08-16 08:15:23 +00:00
Tianyuan Wu
68134b60e4 [CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466)
* WMMA GEMM F16 Implementation

Signed-off-by: root <tianyuwu@amd.com>

* Self-review

Signed-off-by: root <tianyuwu@amd.com>

* ASIC check minor tweak

Signed-off-by: root <tianyuwu@amd.com>

* add missing include file

* Set GPU_TARGETS to gfx11/12 generic

Signed-off-by: root <tianyuwu@amd.com>

* INT8 GFX12

Signed-off-by: root <tianyuwu@amd.com>

* add int8x16 branch

* Fix CI script

Signed-off-by: root <tianyuwu@amd.com>

* Fix typo

Signed-off-by: root <tianyuwu@amd.com>

* Add CK_Tile WMMA example

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Fix CI

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* fix clang format

* Set M/N_Warp Back to Constant

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Use GemmConfigComputeV3 by default

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Remove CK_Tile wmma gemm examples from the CI list

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add atomic add fallback method for gfx11

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Omit copyright year

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Support non-square cases

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add get_device_ip()

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Add atomic add fallback method for gfx11"

This reverts commit 07a79e797d.

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"

This reverts commit ceee918007.

* Revise method name and typos

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Try fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Try fix CI"

This reverts commit 7a7241085e.

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo caused by merge

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Fix typo caused by merging

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

---------

Signed-off-by: root <tianyuwu@amd.com>
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-08-15 16:22:27 -07:00
Kevin Choi
598e3fec41 remove innerloop, move restrict parameters to mainloop and add noinline attribute. 2025-08-14 12:11:17 +00:00
Yashvardhan Agarwal
7f14772406 CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632)
* CK_TILE: Implement two-stage split-K GEMM with reduction

- Added split-K GEMM with reduction example

* comment resolutions
2025-08-14 10:18:52 +02:00
Gino Lu
e5623d3825 fix wrong nan producion. (#2640)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-08-14 15:12:31 +08:00
Kevin Choi
3340408537 Create inner lambda with restrict parameters, add restrict to some parameters 2025-08-14 07:06:51 +00:00
aska-0096
3bc45ecbc7 save for debug 2025-08-14 03:43:54 +00:00
Aviral Goel
8a698c7445 Minor Improvements in CK TILE memory copy EXAMPLE (#2678)
* Rename vector to ThreadTile

* more notes on tile encoding

* remove number<> from tuple of make_tile_window

* add script to stress test the copy example
2025-08-13 15:24:16 -07:00
joyeamd
bcc38deff7 [CK_TILE]fix elementwise example in gfx11/12 (#2676)
* fix elementwise examples

* improve the robust

* fix ck_tile's elementwise test

* update elementwise test
2025-08-13 15:21:46 -07:00
SamiAario-AMD
28a97865f5 Cleanups (#2631)
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp

* Simplify two templated operator calls by having the templated types deduced automatically

* Simplify two GemmPipeline calls

* Fix GemmPipelineAgBgCrCompV4::GetName

* Refactor use of ArgParser in CK tile GEMM examples

* Update args in README.md to match the implementation in create_args

* Remove some unnecessary include statements

* Rename two variables

* Factor out common code

* Factor out do_verify

* Add and use type aliases for memory operation integral constants

* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-08-13 10:12:08 +02:00
Haocong WANG
05a6e92705 Re-enable optimization for gfx950 fmha fwd (#2671)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

* refactor blockgemm change, isolate to v2;

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-08-13 14:57:43 +08:00
Cong Ma
452791a3ba Preshuffle AQ matrix in block scale gemm (#2624)
* Preshuffle AQ matrix in block scale gemm

* turns the output to fp16. Increase the repetition time.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-08-12 21:32:51 -07:00
aska-0096
108abf00e0 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-08-13 02:14:26 +00:00
Thrupti Raj Lakshmana Gowda
3f57ec3d2d GEMM Multi D for CK Tile Engine (#2660)
* Readme for GEMM Multi D

* GEMM Multi D partial Progress

* GEMM Multi D partial Progress!

* CK Tile Engine GEMM Multi D : All Python files generated

* Partial Progress

* Partial Progress

* Partial Progress

* Partial Progress : Incorrect Result

* Partial Progress : Debugging

* Partial Progress : Correct Results

* Partial Progress - Incorrect Results

* Partial Progress - Commenting Passthrough bypass logic

* Changing Passthrough to MultiplyMultiply

* Correct Results!

* Fix and debug the pass through feature

* Sample commit

* Correct Results : MultiplyMultiply

* Code Cleanup

* Removing Failed Instances

* Working code before Unary element support

* Custom Elementwise Function support and working implementation for Mul and Add

* Updating README

* Working for Passthrough

* Review Comments : Minor Fixes

* Review Comments : Minor Fixes

* Readme Updated

* Partial Changes after Rebase

* Working Code : Changes after Rebase

* Updating Jenkins file

* Removing default value changed while testing

* Configuration changes in config files

* Tile Handler changes in GEMM Multi D Tile Engine

* Tile Handler changes in GEMM Multi D Example

* Change log for Gemm Multi D in CK Tile Engine

* Configuration changes in config files

---------

Co-authored-by: ThomasNing <thomasning@amd.com>
2025-08-12 16:05:05 -07:00
joyeamd
0856b3f4a2 [CK_TILE]fix ck_tile's moe_sorting example in gfx11 (#2667)
* fix ck_tile's moe_sorting example in gfx11

* fix clang format

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-08-12 12:33:56 -07:00
aska-0096
0810799e25 refactor blockgemm change, isolate to v2; 2025-08-12 14:25:50 +00:00
asleepzzz
5b39de4bb6 Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)
This reverts commit b7322a521a.
2025-08-12 20:27:10 +08:00
Haocong WANG
b7322a521a Optimize fmha fwd decode & prefill for gfx950 (#2641)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-08-12 19:43:14 +08:00
aska-0096
75f6f6bac4 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-08-12 09:04:41 +00:00
Yi DING
8e1eb0c1ee [CK_TILE] FMHA BWD Decode Pipeline (#2643)
* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2
2025-08-12 17:02:52 +08:00
aska-0096
bcc05eee62 Fix the bug 2025-08-12 08:46:06 +00:00
aska-0096
96d24497f5 fix conflict. disable all v-col instance for fmha fwd 2025-08-12 04:02:41 +00:00
aska-0096
1716171be4 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-08-12 03:52:34 +00:00
Yi DING
4fde1646e5 [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp
2025-08-12 11:11:55 +08:00
aska-0096
1c98007901 clang format 2025-08-12 01:53:31 +00:00
aska-0096
3868ddd708 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-08-11 15:59:40 +00:00
aska-0096
b86f7786e2 tempsave, update the blocksync functions 2025-08-11 14:21:09 +00:00
Yashvardhan Agarwal
191c62967b Fixes to "General 2D Reduction Kernel" (#2535) (#2656)
* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format
2025-08-11 15:01:33 +02:00
aska-0096
7b8052d7ca fix bug in pki4 2025-08-10 06:00:51 +00:00
aska-0096
76cbbb84a2 fix bugs in gemm 2025-08-09 03:25:12 +00:00
aska-0096
efb8549279 fix bug 2025-08-08 17:53:19 +00:00
aska-0096
729e8785fb fix bugs 2025-08-08 15:42:15 +00:00
aska-0096
250dc13c75 fix clangformat with 18.1.3 2025-08-08 09:31:01 +00:00
aska-0096
78edd7303b bug fix, clang format; 2025-08-08 09:04:02 +00:00
aska-0096
3b9fb6af38 Remove unnecessary changes 2025-08-08 08:08:03 +00:00
aska-0096
6bb57c2c57 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-08-08 07:50:12 +00:00
aska-0096
1ecee378d5 remove unnecessary files; rename some files 2025-08-08 06:19:31 +00:00
aska-0096
b4640a9de6 merge fa_decode pipeline into fmha_fwd api 2025-08-08 05:46:18 +00:00