Qianfeng Zhang
179f0e857e
Rename WarpTile in fwd setting
2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966
Simplifying the codes in defining KDram and QDram tile distribution
2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93
Tiny update in GetMaxVectorSize()
2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80
Rename and add scripts for testing hdim96
2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1
Enable hdim96 instances
2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54
Fix with regard to define stride in MakeKLdsBlockDescriptor()
2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb
Further correction with regard to using n0_loops and k1_loops
2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857
Simplify the codes in block_gemm
2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519
Further clarification in using kSubQKHeaddim and kQKHeaddim
2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee
Clarify the using of kSubQKHeaddim and kQKHeaddim
2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be
Tiny fix in GetQKBlockGemm
2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d
Enable the using of WarpTile-32x32x16 and add scripts to verify
2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e
Add static_assert and comments in the with_softmax pipelines
2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd
2025-11-23 04:20:53 +00:00
Emily Martins
2e4b8a8fc4
[CK_TILE] Remove Old CK Tile Stream-K Artifacts ( #3202 )
...
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed
Revert "Add attn sink ( #2892 )" ( #3250 )
...
This reverts commit 9fa4e8d5ab .
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab
Add attn sink ( #2892 )
...
* enable attn sink
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update attn_sink script
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix some error
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* clang-format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd_kernel'mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix ci error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix format error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_bwd_pipeline_default_policy.hpp
* Update fmha_fwd_runner.hpp
* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* update splitkv_pipline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update splitkv&pagedkv pipeline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* add sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update attn_sink result log
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update smoke_test_fwd_sink.sh
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test file
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test script
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
* use constexpr kHasSink for sink in fmha pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* update by pre-commit
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Remove causal mask setting logic from mask.hpp
Removed the mask setting logic for causal masks.
* fix ci error that some usage of lamada not support in c++17
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update remod.py
* add smoke sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update fmha_pagedkv_prefill.py
* Update FmhaFwdPipeline parameters in fmha_fwd.py
* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix c++17 unsupprot error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
* Fix formatting of sink_seq_end assignment
* Fix indentation for sink_seq_end assignment
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e
[CK_TILE] Add Flatmm MX FP8 ( #3208 )
...
* Use async for flatmm mxfp4
* Fix preshuffle
* Add flatmm mxfp8
* Thanks, Copilot
* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa
[ck_tile] Pooling example - Improved tile sizes ( #3233 )
...
* improved tile sizes
- modified tile sizes for improved example performance
* Update example/ck_tile/36_pooling/pool3d.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-11-19 15:30:18 +01:00
John Shumway
ad57f6ef0b
[CK_BUILDER] Put global CK functions in an the CK namespace ( #3232 )
...
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Aviral Goel
ac70206b2c
feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… ( #3225 )
...
* feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test
* docs: Update CHANGELOG.MD
2025-11-18 09:32:27 -05:00
Yi DING
b6720531de
[CK_TILE] MX Flatmm Split kernel instances ( #3207 )
...
* [CK_TILE] MX Flatmm Split kernel instances
* Fix flatmm example compile
2025-11-18 13:46:30 +08:00
Qianfeng Zhang
b75077475b
Remove useless codes in the two trload pipelines
2025-11-15 13:48:50 +00:00
Qianfeng Zhang
238b5c4f08
Separate Traits from Problem while being used for defining the pipeline
2025-11-14 16:42:06 +00:00
BingYuan.Zhou
4d629cd2b0
fix build error ( #3195 )
...
Co-authored-by: root <root@hjbog-srdc-39.amd.com >
2025-11-14 09:46:13 +08:00
yinglu
2a73eb3bc0
Simulate TF32 with BF16x3 ( #3142 )
...
* tf32:bf16x3:use bf16x3 emulate tf32 gemm
* change blockwiseGemm to demo bf16x3
* temp push
* self review
* self review
* fix multi-device compile error
* bug fix
* code refactor
* limit to gfx950
* enhance gemm gfx942 threshold
* lower change from blockwise to warpwise
* refact codes
* refact codes
* error fix
* change threshold
* bug fix
* fix threshold error
* change host reference implement to same as device
* bug fix
* bug fix
* code refact
* fix clang-format fail
* code refine
2025-11-13 16:21:09 -08:00
Qianfeng Zhang
95c1bb25e3
Remove the k_element_func and v_element_func from the pipeline since they are not used
2025-11-13 14:53:47 +00:00
Khushbu Agarwal
fb41a7b73b
fixing ambiguous shuffle definitions ( #3175 )
...
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-11-12 23:44:12 -08:00
Cong Ma
6fd8ddabe7
[CK TILE GEMM] Refactor block_scale_gemm examples ( #3181 )
...
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Add support for rowcol and tensor GEMM operations
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update README
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.
2025-11-12 23:43:40 -08:00
Qianfeng Zhang
881ddc5741
Update to the two trload pipeline to load whole Q-tile once through LDS on mi350
2025-11-12 15:59:38 +00:00
Yashvardhan Agarwal
299c9bca1b
[CK_Tile] Pooling example readme update ( #3174 )
...
* pooling example readme update
- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram
* Update example/ck_tile/36_pooling/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* resolve comments
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-11-12 07:30:20 -08:00
Aviral Goel
b145a5fe80
Add CK Tile Tutorials Folder with GEMM and COPY Kernel ( #3038 )
...
* feat: add tutorial folder with gemm tutorial
* chore: move copy kernel from examples folder to tutorial
* Update tutorial/ck_tile/01_naive_gemm/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tutorial/ck_tile/01_naive_gemm/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* chore: remove handdrawn images
* docs: add write ups to explain the gemm kernel
* docs: add about block level pipeline and static distributed tensors
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-11 14:15:49 -06:00
linqunAMD
1b1c46e508
[CK_TILE] Fix gemm_quant ( #3186 )
2025-11-11 08:23:57 -08:00
Enrico Degregori
1c544abf57
Extend support for ak1 / bk1 WMMA ( #3073 )
...
* Extend AK1 / BK1 support:
- Add support for AK1 != BK1
- Add support for AK1, BK1 > 8
- Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction
* fix clang format
2025-11-11 07:38:15 -08:00
Thomas Ning
9f33b7cfd3
fix input range ( #3188 )
2025-11-10 11:08:41 -08:00
linqunAMD
e593a14ae1
[ck] correct memory size in grouped_gemm_multi_abd_xdl_fixed_nk_bias_bf16_i8 ( #3168 )
...
b1 and b0 use same layout, so, the size of b1_tensors_device should be same with b0_tensors_device's
2025-11-10 10:58:08 -08:00
Qianfeng Zhang
8f876f094e
Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1
2025-11-10 16:11:26 +00:00
Qianfeng Zhang
303818a851
Simplify the codes in block_gemm_areg_bsmem_trload_creg
2025-11-10 15:27:34 +00:00
Xudong Yuan
d04eba4ae3
Ck moe mxfp4 blockm32 ( #3098 )
...
* block_m = 32
* ck block_m = 32
* aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format
* mxfp4_moe v1 pipe
* update format
---------
Co-authored-by: zhimding <zhimding@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: felix <felix.li@amd.com >
2025-11-07 08:45:41 +08:00
Bartłomiej Kocot
2234ff830b
[CK TILE] Convolution remove magic values ( #3160 )
...
* [CK TILE] Refactor Conv configs and Conv Elementwise
* fix
* [CK TILE] Convolution remove magix values
* fix partitioner
2025-11-06 11:26:30 +01:00
Qianfeng Zhang
bd0444f365
[Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1
2025-11-06 08:20:11 +00:00
Qianfeng Zhang
54cd431f16
Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles
2025-11-05 16:23:05 +00:00
Qianfeng Zhang
d190af2ef5
Tiny fix in trload with_softmax/no_softmax pipeline
2025-11-05 14:44:13 +00:00