Qianfeng Zhang
641dae10e8
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857
Simplify the codes in block_gemm
2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519
Further clarification in using kSubQKHeaddim and kQKHeaddim
2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee
Clarify the using of kSubQKHeaddim and kQKHeaddim
2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be
Tiny fix in GetQKBlockGemm
2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d
Enable the using of WarpTile-32x32x16 and add scripts to verify
2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e
Add static_assert and comments in the with_softmax pipelines
2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd
2025-11-23 04:20:53 +00:00
Emily Martins
2e4b8a8fc4
[CK_TILE] Remove Old CK Tile Stream-K Artifacts ( #3202 )
...
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed
Revert "Add attn sink ( #2892 )" ( #3250 )
...
This reverts commit 9fa4e8d5ab .
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab
Add attn sink ( #2892 )
...
* enable attn sink
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update attn_sink script
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix some error
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* clang-format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd_kernel'mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix ci error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix format error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_bwd_pipeline_default_policy.hpp
* Update fmha_fwd_runner.hpp
* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* update splitkv_pipline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update splitkv&pagedkv pipeline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* add sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update attn_sink result log
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update smoke_test_fwd_sink.sh
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test file
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test script
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
* use constexpr kHasSink for sink in fmha pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* update by pre-commit
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Remove causal mask setting logic from mask.hpp
Removed the mask setting logic for causal masks.
* fix ci error that some usage of lamada not support in c++17
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update remod.py
* add smoke sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update fmha_pagedkv_prefill.py
* Update FmhaFwdPipeline parameters in fmha_fwd.py
* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix c++17 unsupprot error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
* Fix formatting of sink_seq_end assignment
* Fix indentation for sink_seq_end assignment
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e
[CK_TILE] Add Flatmm MX FP8 ( #3208 )
...
* Use async for flatmm mxfp4
* Fix preshuffle
* Add flatmm mxfp8
* Thanks, Copilot
* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa
[ck_tile] Pooling example - Improved tile sizes ( #3233 )
...
* improved tile sizes
- modified tile sizes for improved example performance
* Update example/ck_tile/36_pooling/pool3d.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-11-19 15:30:18 +01:00
Aviral Goel
ac70206b2c
feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… ( #3225 )
...
* feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test
* docs: Update CHANGELOG.MD
2025-11-18 09:32:27 -05:00
Yi DING
b6720531de
[CK_TILE] MX Flatmm Split kernel instances ( #3207 )
...
* [CK_TILE] MX Flatmm Split kernel instances
* Fix flatmm example compile
2025-11-18 13:46:30 +08:00
Qianfeng Zhang
b75077475b
Remove useless codes in the two trload pipelines
2025-11-15 13:48:50 +00:00
Qianfeng Zhang
238b5c4f08
Separate Traits from Problem while being used for defining the pipeline
2025-11-14 16:42:06 +00:00
BingYuan.Zhou
4d629cd2b0
fix build error ( #3195 )
...
Co-authored-by: root <root@hjbog-srdc-39.amd.com >
2025-11-14 09:46:13 +08:00
Qianfeng Zhang
95c1bb25e3
Remove the k_element_func and v_element_func from the pipeline since they are not used
2025-11-13 14:53:47 +00:00
Khushbu Agarwal
fb41a7b73b
fixing ambiguous shuffle definitions ( #3175 )
...
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-11-12 23:44:12 -08:00
Cong Ma
6fd8ddabe7
[CK TILE GEMM] Refactor block_scale_gemm examples ( #3181 )
...
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Add support for rowcol and tensor GEMM operations
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update README
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.
2025-11-12 23:43:40 -08:00
Qianfeng Zhang
881ddc5741
Update to the two trload pipeline to load whole Q-tile once through LDS on mi350
2025-11-12 15:59:38 +00:00
Yashvardhan Agarwal
299c9bca1b
[CK_Tile] Pooling example readme update ( #3174 )
...
* pooling example readme update
- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram
* Update example/ck_tile/36_pooling/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* resolve comments
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-11-12 07:30:20 -08:00
Aviral Goel
b145a5fe80
Add CK Tile Tutorials Folder with GEMM and COPY Kernel ( #3038 )
...
* feat: add tutorial folder with gemm tutorial
* chore: move copy kernel from examples folder to tutorial
* Update tutorial/ck_tile/01_naive_gemm/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tutorial/ck_tile/01_naive_gemm/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* chore: remove handdrawn images
* docs: add write ups to explain the gemm kernel
* docs: add about block level pipeline and static distributed tensors
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-11 14:15:49 -06:00
linqunAMD
1b1c46e508
[CK_TILE] Fix gemm_quant ( #3186 )
2025-11-11 08:23:57 -08:00
Thomas Ning
9f33b7cfd3
fix input range ( #3188 )
2025-11-10 11:08:41 -08:00
Qianfeng Zhang
8f876f094e
Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1
2025-11-10 16:11:26 +00:00
Qianfeng Zhang
303818a851
Simplify the codes in block_gemm_areg_bsmem_trload_creg
2025-11-10 15:27:34 +00:00
Bartłomiej Kocot
2234ff830b
[CK TILE] Convolution remove magic values ( #3160 )
...
* [CK TILE] Refactor Conv configs and Conv Elementwise
* fix
* [CK TILE] Convolution remove magix values
* fix partitioner
2025-11-06 11:26:30 +01:00
Qianfeng Zhang
bd0444f365
[Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1
2025-11-06 08:20:11 +00:00
Qianfeng Zhang
54cd431f16
Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles
2025-11-05 16:23:05 +00:00
Qianfeng Zhang
d190af2ef5
Tiny fix in trload with_softmax/no_softmax pipeline
2025-11-05 14:44:13 +00:00
Cong Ma
5abe4109e0
Introduces the new partitioner to implement the reduction StreamK kernel. ( #3107 )
...
* Introduces the new partitioner to implement the reduction StreamK kernel
* Add more doc text to functions
* Add persistent-dp option to streamk example
* Update example/ck_tile/40_streamk_gemm/README.md
2025-11-04 10:32:17 -07:00
Thomas Ning
13ba06f1e7
fix the blockscale 2d case ( #3148 )
...
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
2025-11-04 11:55:23 -05:00
Qianfeng Zhang
99993acca4
Improve both the with_softmax and no_softmax pipelines
2025-11-04 15:46:49 +00:00
Bartłomiej Kocot
8681ced962
[CK TILE] Refactor Conv configs and Conv Elementwise ( #3151 )
...
* [CK TILE] Refactor Conv configs and Conv Elementwise
* fix
2025-11-04 15:04:53 +01:00
Bartłomiej Kocot
99f38e4d9b
[CK TILE] Refactor grouped conv fwd large tensor ( #3144 )
2025-11-04 00:34:48 +01:00
Emily Martins
2ec57a8e70
Replace CK_TILE_PIPELINE macros with a common enum
...
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.
2025-11-03 09:35:05 -07:00
Qianfeng Zhang
bc22b83b19
Add kUseTrLoad = false in non-trload pipeline
2025-11-03 12:40:16 +00:00
Qianfeng Zhang
e40ab20b9e
Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time
2025-11-03 09:40:03 +00:00
Sami Remes
16e85cf179
[CK_TILE] B matrix 2D block scale gemm ( #3074 )
...
* Refactor quant group size to be configurable for M/N/K, not just K
* add some asserts for configurations not implemented
* start setting of group size for N dimension
* enable 2d for reference quant gemm
* WIP: trying to figure out tile dstr and/or indexing for scale matrix
* WIP
* Fix handling of n dim blocks in tile windows etc
* remove commented code and enable all tests again
* fix formatting
* Add more specialized tile distributions
* Enable NWarps replication for bquant tile dstr
* fix formatting
* fix format
* Fix some issues from the merge
* fix formatting
* one more fix to tile dstr, and revert debug initialization
* Remove commented code
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* simplify conditions that are needed for tile distributions
* only enable the working group sizes in tests
* fix formatting
* Update tile distribution for 2D bquant
* add some documentation and 2d block scale example
* fix formatting
* Add in Changlog and restructure the quant 2d example
* fix CMake
* support the change for blockscale 2d
* fix the test file
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Cong Ma <congma13@amd.com >
Co-authored-by: ThomasNing <thomas.ning@amd.com >
2025-11-02 16:49:20 -08:00
Aviral Goel
73f637894d
refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase ( #3132 )
...
* test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf
* refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase
* Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf"
This reverts commit 573c08a085 .
2025-11-02 00:06:28 -04:00
Qianfeng Zhang
e31829384d
Change in updating max_uih_seqlen in the example
2025-11-02 03:59:46 +00:00
Qianfeng Zhang
39cb8c33d1
Use supplement_array_by_last_element(num_targets, ) in example
2025-11-02 03:34:54 +00:00
Qianfeng Zhang
80e08b6efe
Use supplement_array_by_last_element() in example to simplify the codes
2025-11-02 03:17:16 +00:00