Commit Graph

304 Commits

Author SHA1 Message Date
Khushbu Agarwal
2e38eb4f1c Rotating buffer PR CI fix (#2257)
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)"

This reverts commit bbdaf79a52.

* fix regression
2025-06-02 10:25:01 -07:00
valarLip
0fdbf6bcd1 extend buffer load for fp16/bf16x16 (#2270)
* extend buffer load for fp16/bf16x16

* format
2025-06-02 10:29:54 +08:00
Illia Silin
4e561af18c Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185)" (#2260)
This reverts commit fd6a859b44.
2025-05-29 16:22:16 -07:00
joyeamd
fd6a859b44 add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185)
* add cshuffle's mxdlperwavepershuffle support, not finished

* add epilogue functions

* add cshuffle's mxdlperwavepershuffle support, not finished

* add epilogue functions

* update cshuffle logic

* update cshuffle_logics

* add some change within review

* update some codes following the code review

* update epilogue logic

* remove from problem

* update codes following review.

* fix some issues
2025-05-29 14:31:14 +02:00
Illia Silin
bbdaf79a52 Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)
This reverts commit 99857e10e6.
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6 [CK_tile] Add rotating buffer feature for universal gemm (#2200)
* Add rotating buffer feature for universal gemm

* adding changes in tile_engine

* Updated code to merge kernel_launch

* removing comments

* Enable rotating buffer changes to flatmm

* Created diff launch_kernel function for rotating buffer

* Simplfied calculation using macros

* merge code with new changes in tile_engine

* clang formatted

* Redefine macros
2025-05-27 23:00:58 -07:00
Casey-Shi
128f5a1eab [Tile Engine] Add benchmark for tile engine gemm. (#2193)
* initial commit -m benchmark

* only support profile

* fix

* fix doc

* add default config

* add ci

* fix cmake

* tmp save for gen blobs

* fix bug

* merge

* range config

* test success

* fix

* fix

* move struct

* remove config property

* fix config

* remove comment

* add cmake option & modify

* add changelog

* fix

* format

* add pydantic module to the docker image

* fix

* add benchmark for cold and warmp up

* python format

* add asm cache control

* fix README

* remove pydantic module

* modify changelog

* fix config

* recover benchmark_gemm and fix

* format python

* refactor profiler

* fix csv bug

* fix codegen bug

* add kernel instance object

* add benchmark gemm executable

* fix jenkins & delete extra header

* disable warning output & enable default config

* Disable sparsity for invalid warp tile combinations

* fix gemm host template func

* refactor gemm profiler

* filter out some inmstances

* default config test & fix codegen bug

* add sparse flag to gen more instances

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-05-26 22:32:36 -07:00
Po Yen Chen
c42b957d65 [CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask (#2209)
* Assign block indices reversely if kHasMask=true

* Assign block indices reversely for splitkv kernel
2025-05-27 10:58:58 +08:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
Aviral Goel
534d4594d0 Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy (#2214)
* window_origin variable now in base class

* abstracted more functions

* consolidated tile_window_static_distribution and tile_window_static_lengths

* clang format

* skeleton code for tile_window and tile_window_linear consolidation

* more abstraction

* moved variables from child to parent

* clang format

* removed comments

* removed debug code

* removed debug code

* abstracting traits WIP

* consolidated traits

* removed comments and clang formatted
2025-05-21 23:28:00 -07:00
Aviral Goel
fa39c4e798 Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution (#2160)
* added documentation for HostTensorDescriptor

* added documentation for DeviceMem and FillUniformDistribution

* fixed merging error

* fixed host_tensor_descriptor error

* clang format
2025-05-21 10:34:30 -07:00
joye
bc4c7d26af avoid compiling device functions when compile host 2025-05-21 16:52:58 +08:00
joye
8e3e59cb1f fix a commented format issue 2025-05-21 15:28:16 +08:00
joye
2e296ee963 improve codes 2025-05-21 15:22:33 +08:00
joye
7377bc7200 fix some sync issues 2025-05-21 15:05:52 +08:00
joye
619508f89d change lds layout 2025-05-21 14:35:05 +08:00
joye
ba86551534 fix a typo error 2025-05-21 14:29:49 +08:00
joye
37a4a9aae4 fix a CRTP issue 2025-05-21 14:13:09 +08:00
joye
88601c8a05 update async policy 2025-05-21 13:31:13 +08:00
joye
ad32373ff1 fix clang format 2025-05-21 11:56:35 +08:00
joyeamd
8470702ac0 change async pipeline's tile distribution pattern from thread to warp 2025-05-21 11:53:13 +08:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
joye
5079e3f3a2 update lds descriptor 2025-05-20 17:14:29 +08:00
joye
fee156a37d fix a descriptor issue 2025-05-20 16:58:18 +08:00
joye
33c643001e fix some compiling errors 2025-05-20 15:52:33 +08:00
joye
55f3632901 fix a compiling error 2025-05-20 15:16:53 +08:00
joye
c71b3840e8 fix lds descriptor 2025-05-20 15:00:32 +08:00
joye
3500259bdc update async load apis 2025-05-20 13:26:49 +08:00
joye
91113f6464 comment some not gfx950 codes 2025-05-19 10:13:49 +08:00
joye
4a544c4084 comment some not gfx950 codes 2025-05-19 10:11:49 +08:00
joye
5c936dc8f3 add ignore header file 2025-05-19 09:34:26 +08:00
joye
0f2dfa8d38 fix async inline assembly 2025-05-19 09:33:17 +08:00
joye
2a9f0fff5d fix async inline assembly 2025-05-19 09:30:53 +08:00
joye
f0e1dbca49 fix some issues in async load 2025-05-19 08:56:03 +08:00
joye
7a03addc12 fix some compiling errors 2025-05-19 08:28:29 +08:00
Po Yen Chen
791802b381 [CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198)
* Write soft-sign in inline asm

* Change tile idx computation

* Add macro to turn off soft-sign asm opt

* Use simple for loop to avoid register spill

* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
joye
9c429ad0cc add a new pipeline for async load 2025-05-16 09:25:16 +08:00
Khushbu Agarwal
3d8d6e75e4 Adding validation for tile sizes in Tile Engine (#2189)
* Adding validation for tile sizes

* Add architecture in config, and shuffle lines of code in warp_gemm.hpp

* Enable MFMA for gfx950, and invalid tile handling
2025-05-15 10:28:31 -07:00
joye
acdef41575 add a pipeline which copies from v4 2025-05-15 18:00:30 +08:00
joye
0c15891dea fix some compiling errors 2025-05-15 14:54:13 +08:00
joye
8c09b31f72 fix a compiling error 2025-05-15 14:39:21 +08:00
joye
2fdcb55cf1 fix some compiling errors 2025-05-15 14:36:02 +08:00
joye
726ae62113 add async load api 2025-05-15 13:36:46 +08:00
joye
4dd9e8c0c8 add for async load builtin 2025-05-15 13:29:39 +08:00
BingYuan.Zhou
41c17d0a95 fix moe sorting build fail (#2190)
* fix moe sorting build fail

* refile code

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-14 09:31:26 +08:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
f05e45ba59 Disable SMFMA gfx90a (#2184)
* sparsity fix for gfx90a

* reverting tile_engine changes
2025-05-12 09:56:23 -07:00