Jakub Piasecki
116a503274
resolve conflicts
2025-06-16 15:33:24 +00:00
Bartlomiej Kocot
1996c99bee
rebase fixes
2025-06-16 15:03:09 +00:00
Jakub Piasecki
60bd2a4fdf
resolved conflicts
2025-06-16 12:22:53 +00:00
Bartlomiej Kocot
6e2b32a58a
Merge branch 'develop' of github.com:ROCm/composable_kernel into barkocot/ck-tile-grouped-conv-fwd
2025-06-16 11:49:21 +00:00
Bartlomiej Kocot
a44fe5f477
refactor
2025-06-16 11:48:36 +00:00
Thomas Ning
d996bc78be
fix the flatmm ( #2349 )
2025-06-16 02:17:53 -07:00
ruanjm
b34c234f51
Add support for specifying valid flag when fetching elements for tile_scatter_gather ( #2332 )
...
* Add support for specifying valid flag when fetching elements for tile_scatter_gather
Add constexpr for operator[] of TrueGenerator
* Use different path when valid is enabled
2025-06-16 17:17:03 +08:00
carlushuang
fb97f75099
hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem ( #2348 )
2025-06-15 22:49:04 -07:00
Mateusz Ozga
bd96ac9742
[CK_TILE] Multiple-D GEMM example ( #2219 )
...
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails
2025-06-13 19:39:11 +02:00
Bartlomiej Kocot
0a3b31a6f3
Merge branch 'develop' of github.com:ROCm/composable_kernel into barkocot/ck-tile-grouped-conv-fwd
2025-06-13 13:22:14 +00:00
kylasa
5f1ad09b61
Code drop for 2 warp ping pong scheduler along K dimension. ( #2276 )
...
* Code drop for 2 warp ping pong scheduler along K dimension.
* Addressing code review comments.
* Addressing Clang formatting issues.
* Addressing build issues.
* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.
* Fix for LDS memory size for GEMM pipelines.
* Addressing code review feedback comments.
* Change log update.
* Addressing code review comments and build issues.
* Added new policy for pipeline specific logic about LDS needs.
* Clang Fix during build.
2025-06-12 18:24:02 -07:00
Thomas Ning
f59b8c7d3d
OCP FP8 Macro restructure ( #2331 )
...
* solved the problem
2025-06-12 09:46:33 -07:00
carlushuang
8aff45a8af
[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel ( #2327 )
...
* refactor subtoken logic to let more kernel pickup mp kernel
* typo
2025-06-12 11:44:22 +08:00
Thomas Ning
06e0b8436c
Epilogue cshuffle Improvement ( #2312 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
* solve the previous PR error, refine the code
* Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Comment addressed
* handling tile_engine failing case
* handling tile_engine failing case
---------
Co-authored-by: joyeamd <John.Ye@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
Co-authored-by: khushbu agarwal <khuagarw@amd.com >
2025-06-10 22:44:50 -07:00
Thomas Ning
14d229d6c8
fix on the typo ( #2326 )
2025-06-10 16:34:33 -07:00
Khushbu Agarwal
bd270fe4bc
fix flatmm kernel for bigger size for fp16 datatype ( #2302 )
2025-06-10 11:13:40 -07:00
Eisuke Kawashima
4e586ca958
chore: unset executable permission ( #2303 )
...
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com >
2025-06-10 09:13:59 -07:00
John Afaganis
6635d1bb88
Remove usage of 'warpSize' variable as it has been deprecated ( #2295 )
...
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com >
2025-06-10 07:34:54 -07:00
carlushuang
2e0536269e
hot fix ( #2315 )
2025-06-10 20:35:28 +08:00
MHYangAMD
9fcf21a4ec
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
carlushuang
65835c0bbb
MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! ( #2305 )
2025-06-10 10:40:54 +08:00
Aviral Goel
5a0bd157db
Code Refactor for check_err.hpp ( #2284 )
...
* refactor & add documentation
* removed return datatype from doxygen comments
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
---------
Co-authored-by: John Afaganis <john.afaganis@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-06-08 13:41:27 -07:00
Sami Remes
1c6f83df6c
[CK_TILE] Tileloop persistent gemm - resubmit ( #2299 )
...
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 )
This reverts commit 233e274077 .
* Add missing header for kentry
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-06-06 14:18:49 -07:00
valarLip
8482977a37
extend buffer load to support load 32 bf16/fp16 at same time ( #2291 )
2025-06-06 17:21:19 +08:00
Andriy Roshchenko
00247e3c29
Optimized GEMMs for MX FP4/8 ( #2294 )
...
Adds V3 GEMM pipeline for MX FP4 and MX FP8
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275 )
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
2025-06-05 13:54:15 -06:00
Illia Silin
233e274077
Revert "[CK_TILE] Tile loop persistent gemm kernel ( #2191 )" ( #2293 )
...
This reverts commit ffb52783d0 .
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59
[CK_TILE] Move GEMM pipeline tail handling logic to pipelines ( #2222 )
...
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0
[CK_TILE] Tile loop persistent gemm kernel ( #2191 )
...
* Implement tile loop persistent gemm kernel
* Enable timing
* Add tests for persistent gemm
* Fix formatting
* Fix gemm_basic
* Rename True/False to Persistent/NonPersistent
* Use only one set of layouts for persistent tests
* Fix gemm example persistent template parameter
* Fix formatting
2025-06-04 11:46:28 +03:00
Khushbu Agarwal
59a85cb4bc
[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes ( #2262 )
...
* debugging issue
* debugging issue
* debugging
* debugging
* reverting debugging code
* clang formatted
* updating default_config.json
* fix ci failure
* clang formatted
2025-06-03 20:16:10 -07:00
Khushbu Agarwal
2e38eb4f1c
Rotating buffer PR CI fix ( #2257 )
...
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 )"
This reverts commit bbdaf79a52 .
* fix regression
2025-06-02 10:25:01 -07:00
valarLip
0fdbf6bcd1
extend buffer load for fp16/bf16x16 ( #2270 )
...
* extend buffer load for fp16/bf16x16
* format
2025-06-02 10:29:54 +08:00
Illia Silin
4e561af18c
Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )" ( #2260 )
...
This reverts commit fd6a859b44 .
2025-05-29 16:22:16 -07:00
joyeamd
fd6a859b44
add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
2025-05-29 14:31:14 +02:00
Illia Silin
bbdaf79a52
Revert "[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )" ( #2256 )
...
This reverts commit 99857e10e6 .
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6
[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )
...
* Add rotating buffer feature for universal gemm
* adding changes in tile_engine
* Updated code to merge kernel_launch
* removing comments
* Enable rotating buffer changes to flatmm
* Created diff launch_kernel function for rotating buffer
* Simplfied calculation using macros
* merge code with new changes in tile_engine
* clang formatted
* Redefine macros
2025-05-27 23:00:58 -07:00
Casey-Shi
128f5a1eab
[Tile Engine] Add benchmark for tile engine gemm. ( #2193 )
...
* initial commit -m benchmark
* only support profile
* fix
* fix doc
* add default config
* add ci
* fix cmake
* tmp save for gen blobs
* fix bug
* merge
* range config
* test success
* fix
* fix
* move struct
* remove config property
* fix config
* remove comment
* add cmake option & modify
* add changelog
* fix
* format
* add pydantic module to the docker image
* fix
* add benchmark for cold and warmp up
* python format
* add asm cache control
* fix README
* remove pydantic module
* modify changelog
* fix config
* recover benchmark_gemm and fix
* format python
* refactor profiler
* fix csv bug
* fix codegen bug
* add kernel instance object
* add benchmark gemm executable
* fix jenkins & delete extra header
* disable warning output & enable default config
* Disable sparsity for invalid warp tile combinations
* fix gemm host template func
* refactor gemm profiler
* filter out some inmstances
* default config test & fix codegen bug
* add sparse flag to gen more instances
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: khuagarw <khuagarw@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-05-26 22:32:36 -07:00
Po Yen Chen
c42b957d65
[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask ( #2209 )
...
* Assign block indices reversely if kHasMask=true
* Assign block indices reversely for splitkv kernel
2025-05-27 10:58:58 +08:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1
fix the buffer intrinsic names for clang >=20 ( #2228 )
2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6
Revert "Update the buffer load/store intrinsic names for clang>=20. ( #2192 )" ( #2227 )
...
This reverts commit 58f9e9ffbc .
2025-05-22 15:41:17 -07:00
Aviral Goel
534d4594d0
Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy ( #2214 )
...
* window_origin variable now in base class
* abstracted more functions
* consolidated tile_window_static_distribution and tile_window_static_lengths
* clang format
* skeleton code for tile_window and tile_window_linear consolidation
* more abstraction
* moved variables from child to parent
* clang format
* removed comments
* removed debug code
* removed debug code
* abstracting traits WIP
* consolidated traits
* removed comments and clang formatted
2025-05-21 23:28:00 -07:00
Aviral Goel
fa39c4e798
Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution ( #2160 )
...
* added documentation for HostTensorDescriptor
* added documentation for DeviceMem and FillUniformDistribution
* fixed merging error
* fixed host_tensor_descriptor error
* clang format
2025-05-21 10:34:30 -07:00
Sami Remes
d1e6f0982d
[CK_TILE] Grouped GEMM tile loop ( #2146 )
...
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-05-20 17:18:57 +03:00
Po Yen Chen
791802b381
[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations ( #2198 )
...
* Write soft-sign in inline asm
* Change tile idx computation
* Add macro to turn off soft-sign asm opt
* Use simple for loop to avoid register spill
* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
Khushbu Agarwal
3d8d6e75e4
Adding validation for tile sizes in Tile Engine ( #2189 )
...
* Adding validation for tile sizes
* Add architecture in config, and shuffle lines of code in warp_gemm.hpp
* Enable MFMA for gfx950, and invalid tile handling
2025-05-15 10:28:31 -07:00
BingYuan.Zhou
41c17d0a95
fix moe sorting build fail ( #2190 )
...
* fix moe sorting build fail
* refile code
---------
Co-authored-by: solin <bingzhou@amd.com >
2025-05-14 09:31:26 +08:00
Illia Silin
58f9e9ffbc
Update the buffer load/store intrinsic names for clang>=20. ( #2192 )
...
* fix the buffer load/store intrinsic names
* fix clang format
2025-05-13 10:18:14 -07:00
Bartlomiej Kocot
976e32ccfa
custom vector size
2025-05-13 14:38:36 +00:00
Bartlomiej Kocot
aaca2e2b08
[CK TILE] Grouped Convolution Forward Kernel
2025-05-13 11:52:11 +00:00
Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00