Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
f05e45ba59
Disable SMFMA gfx90a ( #2184 )
...
* sparsity fix for gfx90a
* reverting tile_engine changes
2025-05-12 09:56:23 -07:00
Thomas Ning
9d1e44e56a
Vectorized Transpose for Batched Transpose CK Tile Operator ( #2131 )
...
* Shared Memory for single data point
* CKTile Transpose vectorize CP1
* CKTile Transpose vectorize CP2
* CKTile Transpose vectorize CP2.1
* fixed the compile error of the transpose tile 2d
* Have the correct result for the current test sample
* Changes to printing tensor
* fp8 support added
* Debugging for transpose
* solving the corner issue
* Changed padding flag
* Intermideate Debugging
* Intermidiate Debugging
* Intermediate Debugging
* Finished debugging of the transpose op
* Code Cleanup
* Adding edge case smoke tests
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Addressing Review Comment
* Addressing Comments
* Addressing Comments
* Measuring Perf Tests
* Code Cleanup
* Changlog
* Added the running iterations
* clang format
* Fix the changelog
* Fix the compilation error
* change the printing factor
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com >
2025-05-12 00:41:45 -07:00
Khushbu Agarwal
d8faf1c6a1
Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 ( #2172 )
...
* Changes for updating tile distribution for shuffle and transpose
* Fixed swizzle and transpose, removed comments
* clang formatted
* Adding support for bf16 type
* Addressing review comments
2025-05-10 22:40:05 -07:00
Khushbu Agarwal
ef72a4b9bc
Disable SMFMA for gfx90a ( #2182 )
2025-05-09 00:18:07 -07:00
Thomas Ning
c757046d49
Revert "Disable the SMFMA instruction for gfx90a. ( #2174 )" ( #2175 )
...
This reverts commit a32d907771 .
2025-05-08 00:07:03 -07:00
Khushbu Agarwal
a32d907771
Disable the SMFMA instruction for gfx90a. ( #2174 )
...
* remove smfma for gfx90a
* clang formatted
2025-05-07 23:09:22 -07:00
BingYuan.Zhou
6a3960c1e1
Flatmm merge ( #2168 )
...
* sync with function interface of cshuffleepiloge,fix flatmm build fail
* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm
---------
Co-authored-by: solin <bingzhou@amd.com >
2025-05-08 12:59:57 +08:00
jakpiase
cb07ad84d5
fix for default epilogue ( #2167 )
2025-05-07 10:46:53 -07:00
Aviral Goel
769336b640
[CK_TILE] Add type traits to detect tile window types at compile time ( #2158 )
...
* added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline
* added type traits instead of enum to tile_window() and tile_window_linear() with debug comments
* removed comments, added documentation and clang format
2025-05-07 00:00:39 -07:00
carlushuang
4e9b76f88c
[CK_TILE] optimize moe sorting kernel, boost large context case up to 20x ( #2153 )
...
* combine 2-3 as single stage
* support zeroing
* improve long tokens
* update specialization
* b16 ws
* 8bit topk optimize
* update 15 example
2025-05-06 17:32:07 +08:00
jakpiase
0bcb804ad0
[CK_TILE] Remove scratch usage from universal gemm ( #2001 )
...
* moves kbatch condition outside of kernel
* add reviewer comments
* fixes
* fix tests
* fixes after review
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-05-05 18:46:44 +02:00
Khushbu Agarwal
d58f2b8bd0
mfma_32x32x64_fp8/bf8 ( #2148 )
...
* support for mfma_32x32x64_fp8
* clang-formatted
* Fixing sparsity in codegen
2025-05-01 13:36:24 -07:00
Aviral Goel
1d8ef40760
Add documentation for ck_tile::array<T,N> ( #2078 )
...
* addded documentation for ck_tile::array<T,N>
* clang format fix
* spelling errros
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* spelling errros
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Apply suggestions from code review
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Revert "spelling errros"
This reverts commit 4179e7d193 .
* Revert "spelling errros"
This reverts commit 3f90733dbe .
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
Co-authored-by: John Afaganis <john.afaganis@amd.com >
2025-04-30 16:43:36 -07:00
Illia Silin
9a9f59ae69
Revert "Add ck tile examples to package ( #1880 )" ( #2150 )
2025-04-30 10:20:16 -07:00
Aviral Goel
65f182d617
Add Matrix A and Matrix B Swizzle for LDS in Computev4 policy ( #2136 )
...
* fixed computev4 policy bug for lds swizzle
* added swizzle for input matrix B
* Improved ComputeV4 policy and pipeline by swizzling A and B
* consolidated LDS descriptor functions in parent struct
2025-04-28 18:20:47 -07:00
Khushbu Agarwal
d107f3c3a5
Support for MFMA_16x16x128 for fp8/bf8 ( #2125 )
...
* Adding 16x16x128 support for gfx950
* Support for fp8 and bf8
* fix input arguments for MFMA scale instruction
* clang-formatted
* Fixes for lwpck-3145 (#2138 )
* Fix lds tile & cmake dep & default epilogue
* Fallback BTypeToUse to ADataType in WOQ cases
* reverting instance json file
* reverting instance json file
---------
Co-authored-by: Yi DING <yi.ding@amd.com >
2025-04-28 18:19:50 -07:00
jakpiase
434d19f696
Add ck tile examples to package ( #1880 )
...
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-04-28 09:53:19 -07:00
Yi DING
8add2cf45d
Fix fp8 convert & add option for basic example ( #2129 )
2025-04-27 16:26:05 -07:00
Khushbu Agarwal
a2ed34a112
MFMA_32x32x16 for gfx950 ( #2121 )
...
* Enable MFMA_32x32x16 for fp16/BF16 for gfx950
* clang formatted
2025-04-24 10:20:22 -07:00
Illia Silin
01cb8379cd
make code compliant with std=c++20 ( #2123 )
2025-04-24 10:14:52 -07:00
carlushuang
5487289fc4
[CK_TILE] support gfx950 matrix core in 01_fmha fwd ( #2110 )
...
* gfx950 01_fmha fwd
* fix comment
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-04-23 12:40:18 -07:00
Gino Lu
504f563f78
[CK-Tile] warp-gemm support for using V_MFMA_F32_16x16x32_BF16 ( #2073 )
...
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* fix error while testing new command
* Finished the feature of new mfma 16*16*32
* Addressed the comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
2025-04-22 15:52:36 -07:00
Thomas Ning
a738e43445
MFMA 16x16x32fp8 ( #2103 )
...
* add mfma_16x16x32_fp8
* clang format code
* Finished the fix for gemm basic
* clang foramt
* rebuild CI
* recover gemm.hpp
* add MFMA 16*16*32bf8
---------
Co-authored-by: solin <bingzhou@amd.com >
2025-04-21 10:21:35 -07:00
solin
c318ec0778
fix CI build fail
2025-04-21 16:00:12 +08:00
BingYuan.Zhou
eaf1f0bf3b
[flatmm] implement basic fp16 flatmm ( #2089 )
...
* [flatmm] implement basic fp16 flatmm
* fix CI build fail
---------
Co-authored-by: root <root@hjbog-srdc-50.amd.com >
Co-authored-by: solin <bingzhou@amd.com >
2025-04-16 16:51:17 +08:00
Thomas Ning
269f4f6af5
Solve the Static Encoding Pattern compile error when the tile size is too small ( #2079 )
2025-04-13 20:09:30 -07:00
jakpiase
6c61f4d237
[CK_TILE] Add 2:4 structured sparsity support for fp16 gemm ( #1957 )
...
* add structured sparsity fp16 support for gemm
* added reviewer suggestions
* update changelog
* update changelog
* add reviewers suggestions
* Minor fix
* clang fix
* fix doxygen
2025-04-11 12:18:26 +02:00
slippedJim
5f885d2b7a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño
f14e648e7c
Replace inline assembly with builtins in FHMA ( #2067 )
...
* Replace inline assembly with builtins in FHMA
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
2025-04-10 09:48:37 +02:00
Illia Silin
3e6d21adeb
enable gfx115x support ( #2065 )
2025-04-09 10:06:42 -07:00
MHYang-gh
03ce8729fd
Make buffer coherence configurable in tensor view ( #2041 )
...
* Make buffer coherence configurable in tensor view
* Fix clang-format for tensor_view.hpp
2025-04-08 15:34:11 -07:00
Max Podkorytov
6ce0797dad
simplify generate_tuple ( #2043 )
2025-04-08 09:00:51 -07:00
aledudek
80aae6119b
[CK_TILE] Fix GEMM Memory Pipeline ( #2034 )
...
* [CK_TILE] Fix GEMM Memory Pipeline
* Fix transpose tile
* Add comments
2025-04-08 12:40:04 +02:00
Illia Silin
572cd820ce
Split env.hpp header from the ck.hpp header. ( #2049 )
...
* split env.hpp out of main headers
* fix namespace logic
2025-04-03 15:30:21 -07:00
Adam Osewski
e5ad48a784
Basic docs for universal gemm & ck-tile gemm. ( #2014 )
...
* Basic docs for universal gemm & ck-tile gemm.
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Reviewers suggestions.
* Align tparam names in doc with class tparams.
* More reviewers fine tuning ;)
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-04-02 11:03:40 +02:00
Seunghoon Lee
df32020f93
Fix Windows build. ( #2012 )
...
* Remove duplicate using uint64_t.
* Cast before shift.
2025-04-01 12:22:10 -07:00
Max Podkorytov
c59a8bb206
add a fast compilation path for static for (0..N) ( #2005 )
...
* add a fast compilation path for static for (0..N)
* Update functional2.hpp
add comment and put range applier into detail namespace
* Update functional.hpp
ditto for ck-tile
* prettify
* prettify more
* add comment
* clang-format
2025-04-01 12:06:25 -07:00
rocking
8a20b62e91
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-28 21:58:06 +08:00
felix
a82f338fb9
hotfix fix sorting int64 ( #2025 )
...
* fix sorting int64
* clang format
* fix example issue
* update WA issue #
---------
Co-authored-by: coderfeli <coderfeli@163.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
2025-03-28 11:31:52 +08:00
ruanjm
d49abdaa87
[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance ( #1861 )
...
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
2025-03-25 20:09:45 +08:00
MHYang-gh
c027637a8f
Fix A/B lds transform ( #2007 )
2025-03-22 23:13:50 -07:00
BingYuan.Zhou
5a0d693b86
fix ck_tile/basic_gemm build error ( #1988 )
2025-03-20 22:01:14 -07:00
carlushuang
e3c9886cdf
[CK_TILE] return value with macro in ck_tile::kernel_launch API ( #1982 )
...
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992 )
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61
[CK_TILE] Switch to universal gemm for batched and grouped gemms ( #1919 )
...
* switch to universal gemm for batched and grouped gemms
* added reviewer comments
* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00
rocking
b819c217e4
Sync the kname with instance name ( #1989 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-20 00:06:45 +08:00
carlushuang
3e81279d26
Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" … ( #1971 )
...
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 )
This reverts commit 8cbcd3e0d0 .
* fix codegen problem
* Update config.hpp
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-03-13 11:41:39 +08:00
Illia Silin
4c97cc511e
use old instrinsics with staging compiler ( #1970 )
2025-03-12 07:29:09 -07:00
Illia Silin
8cbcd3e0d0
Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" ( #1969 )
...
This reverts commit 7a93b16ff6 .
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6
[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )
...
* support hdim=192/128 pair
* remove useless print
* update
2025-03-11 21:07:40 +08:00