aska-0096
c69d450b22
bm0=32 not correct, support hdim48, performance not good
2025-07-11 06:55:02 +00:00
aska-0096
18669925cc
temp save, change all instance to 1wave
2025-07-10 04:29:33 +00:00
aska-0096
18686cfe5b
tempsave, fmha_decode
2025-07-08 08:37:20 +00:00
aska-0096
47565f21a5
temp save, waiting for debug
2025-06-21 15:02:57 +00:00
aska-0096
e0a634ef97
save an example for __bf16 type
2025-06-19 05:11:52 +00:00
aska-0096
4bd5fd4a3c
fix bwd code
2025-06-18 07:27:24 +00:00
aska-0096
69809d9513
Fix for fwd/bwd kernel build filter
2025-06-18 06:38:03 +00:00
Thomas Ning
64a2fda713
Revert "Fix default epilogue ( #2358 )" ( #2364 )
...
This reverts commit cd606f72c1 .
2025-06-17 22:43:05 -07:00
carlushuang
a4e1248dba
[CK_TILE] moe_sorting support "local_tokens" feature for EP case ( #2335 )
...
* support local_token for hipgraph
* update README
* fix comment
* fix fmoe example
2025-06-18 10:49:43 +08:00
Max Podkorytov
cd606f72c1
Fix default epilogue ( #2358 )
...
* [ck-tile] fix default epilogue in gemm universal
* argument validation needs vector size D
* operator() needs to specify dram windows
* copy/paste from cshuffle epilogue
* clang-format
* mark unused argument
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-06-17 17:30:21 -07:00
linqunAMD
0eb8974502
[CK_TILE] Support multi-config in tile_example_gemm_universal ( #2240 )
...
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Address review comments
* rebase code and fix clang format.
* fix clang format
* support pipeline v5.
* fix merge conflict
* address review comment
* add missing file
* address review comment v2
* fix build error
2025-06-17 17:27:46 -07:00
Satyanvesh Dittakavi
4c57157d50
Do not use warpSize as compile time constant as it is removed ( #2320 )
...
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
2025-06-17 11:54:30 -07:00
Thomas Ning
3c4cdfac4f
Fix the CK Tile related operators ( #2356 )
...
* fix the flatmm
* Fix the pipeline
* address the comment
2025-06-16 17:38:52 -07:00
Illia Silin
5523df4b2d
Revert "fix the flatmm ( #2349 )" ( #2352 )
...
This reverts commit d996bc78be .
2025-06-16 07:54:55 -07:00
Thomas Ning
d996bc78be
fix the flatmm ( #2349 )
2025-06-16 02:17:53 -07:00
ruanjm
b34c234f51
Add support for specifying valid flag when fetching elements for tile_scatter_gather ( #2332 )
...
* Add support for specifying valid flag when fetching elements for tile_scatter_gather
Add constexpr for operator[] of TrueGenerator
* Use different path when valid is enabled
2025-06-16 17:17:03 +08:00
carlushuang
fb97f75099
hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem ( #2348 )
2025-06-15 22:49:04 -07:00
Mateusz Ozga
bd96ac9742
[CK_TILE] Multiple-D GEMM example ( #2219 )
...
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails
2025-06-13 19:39:11 +02:00
kylasa
5f1ad09b61
Code drop for 2 warp ping pong scheduler along K dimension. ( #2276 )
...
* Code drop for 2 warp ping pong scheduler along K dimension.
* Addressing code review comments.
* Addressing Clang formatting issues.
* Addressing build issues.
* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.
* Fix for LDS memory size for GEMM pipelines.
* Addressing code review feedback comments.
* Change log update.
* Addressing code review comments and build issues.
* Added new policy for pipeline specific logic about LDS needs.
* Clang Fix during build.
2025-06-12 18:24:02 -07:00
Thomas Ning
f59b8c7d3d
OCP FP8 Macro restructure ( #2331 )
...
* solved the problem
2025-06-12 09:46:33 -07:00
carlushuang
8aff45a8af
[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel ( #2327 )
...
* refactor subtoken logic to let more kernel pickup mp kernel
* typo
2025-06-12 11:44:22 +08:00
Thomas Ning
06e0b8436c
Epilogue cshuffle Improvement ( #2312 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
* solve the previous PR error, refine the code
* Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Comment addressed
* handling tile_engine failing case
* handling tile_engine failing case
---------
Co-authored-by: joyeamd <John.Ye@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
Co-authored-by: khushbu agarwal <khuagarw@amd.com >
2025-06-10 22:44:50 -07:00
Thomas Ning
14d229d6c8
fix on the typo ( #2326 )
2025-06-10 16:34:33 -07:00
Khushbu Agarwal
bd270fe4bc
fix flatmm kernel for bigger size for fp16 datatype ( #2302 )
2025-06-10 11:13:40 -07:00
Eisuke Kawashima
4e586ca958
chore: unset executable permission ( #2303 )
...
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com >
2025-06-10 09:13:59 -07:00
John Afaganis
6635d1bb88
Remove usage of 'warpSize' variable as it has been deprecated ( #2295 )
...
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com >
2025-06-10 07:34:54 -07:00
carlushuang
2e0536269e
hot fix ( #2315 )
2025-06-10 20:35:28 +08:00
MHYangAMD
9fcf21a4ec
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
carlushuang
65835c0bbb
MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! ( #2305 )
2025-06-10 10:40:54 +08:00
Aviral Goel
5a0bd157db
Code Refactor for check_err.hpp ( #2284 )
...
* refactor & add documentation
* removed return datatype from doxygen comments
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
---------
Co-authored-by: John Afaganis <john.afaganis@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-06-08 13:41:27 -07:00
Sami Remes
1c6f83df6c
[CK_TILE] Tileloop persistent gemm - resubmit ( #2299 )
...
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 )
This reverts commit 233e274077 .
* Add missing header for kentry
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-06-06 14:18:49 -07:00
valarLip
8482977a37
extend buffer load to support load 32 bf16/fp16 at same time ( #2291 )
2025-06-06 17:21:19 +08:00
Andriy Roshchenko
00247e3c29
Optimized GEMMs for MX FP4/8 ( #2294 )
...
Adds V3 GEMM pipeline for MX FP4 and MX FP8
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275 )
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
2025-06-05 13:54:15 -06:00
Illia Silin
233e274077
Revert "[CK_TILE] Tile loop persistent gemm kernel ( #2191 )" ( #2293 )
...
This reverts commit ffb52783d0 .
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59
[CK_TILE] Move GEMM pipeline tail handling logic to pipelines ( #2222 )
...
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0
[CK_TILE] Tile loop persistent gemm kernel ( #2191 )
...
* Implement tile loop persistent gemm kernel
* Enable timing
* Add tests for persistent gemm
* Fix formatting
* Fix gemm_basic
* Rename True/False to Persistent/NonPersistent
* Use only one set of layouts for persistent tests
* Fix gemm example persistent template parameter
* Fix formatting
2025-06-04 11:46:28 +03:00
Khushbu Agarwal
59a85cb4bc
[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes ( #2262 )
...
* debugging issue
* debugging issue
* debugging
* debugging
* reverting debugging code
* clang formatted
* updating default_config.json
* fix ci failure
* clang formatted
2025-06-03 20:16:10 -07:00
Khushbu Agarwal
2e38eb4f1c
Rotating buffer PR CI fix ( #2257 )
...
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 )"
This reverts commit bbdaf79a52 .
* fix regression
2025-06-02 10:25:01 -07:00
valarLip
0fdbf6bcd1
extend buffer load for fp16/bf16x16 ( #2270 )
...
* extend buffer load for fp16/bf16x16
* format
2025-06-02 10:29:54 +08:00
Illia Silin
4e561af18c
Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )" ( #2260 )
...
This reverts commit fd6a859b44 .
2025-05-29 16:22:16 -07:00
joyeamd
fd6a859b44
add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
2025-05-29 14:31:14 +02:00
Illia Silin
bbdaf79a52
Revert "[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )" ( #2256 )
...
This reverts commit 99857e10e6 .
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6
[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )
...
* Add rotating buffer feature for universal gemm
* adding changes in tile_engine
* Updated code to merge kernel_launch
* removing comments
* Enable rotating buffer changes to flatmm
* Created diff launch_kernel function for rotating buffer
* Simplfied calculation using macros
* merge code with new changes in tile_engine
* clang formatted
* Redefine macros
2025-05-27 23:00:58 -07:00
Casey-Shi
128f5a1eab
[Tile Engine] Add benchmark for tile engine gemm. ( #2193 )
...
* initial commit -m benchmark
* only support profile
* fix
* fix doc
* add default config
* add ci
* fix cmake
* tmp save for gen blobs
* fix bug
* merge
* range config
* test success
* fix
* fix
* move struct
* remove config property
* fix config
* remove comment
* add cmake option & modify
* add changelog
* fix
* format
* add pydantic module to the docker image
* fix
* add benchmark for cold and warmp up
* python format
* add asm cache control
* fix README
* remove pydantic module
* modify changelog
* fix config
* recover benchmark_gemm and fix
* format python
* refactor profiler
* fix csv bug
* fix codegen bug
* add kernel instance object
* add benchmark gemm executable
* fix jenkins & delete extra header
* disable warning output & enable default config
* Disable sparsity for invalid warp tile combinations
* fix gemm host template func
* refactor gemm profiler
* filter out some inmstances
* default config test & fix codegen bug
* add sparse flag to gen more instances
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: khuagarw <khuagarw@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-05-26 22:32:36 -07:00
Po Yen Chen
c42b957d65
[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask ( #2209 )
...
* Assign block indices reversely if kHasMask=true
* Assign block indices reversely for splitkv kernel
2025-05-27 10:58:58 +08:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1
fix the buffer intrinsic names for clang >=20 ( #2228 )
2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6
Revert "Update the buffer load/store intrinsic names for clang>=20. ( #2192 )" ( #2227 )
...
This reverts commit 58f9e9ffbc .
2025-05-22 15:41:17 -07:00
Aviral Goel
534d4594d0
Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy ( #2214 )
...
* window_origin variable now in base class
* abstracted more functions
* consolidated tile_window_static_distribution and tile_window_static_lengths
* clang format
* skeleton code for tile_window and tile_window_linear consolidation
* more abstraction
* moved variables from child to parent
* clang format
* removed comments
* removed debug code
* removed debug code
* abstracting traits WIP
* consolidated traits
* removed comments and clang formatted
2025-05-21 23:28:00 -07:00
Aviral Goel
fa39c4e798
Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution ( #2160 )
...
* added documentation for HostTensorDescriptor
* added documentation for DeviceMem and FillUniformDistribution
* fixed merging error
* fixed host_tensor_descriptor error
* clang format
2025-05-21 10:34:30 -07:00