Commit Graph

1073 Commits

Author SHA1 Message Date
Qianfeng Zhang
76da618c85 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio 2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8 Update to support grouped mode hstu attention 2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67 Using in-place version of block_tile_reduce() so that using of m_local is avoided 2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() 2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7 Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717 Add softmax selection to two of the testing scripts 2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling 2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q 2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking 2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7 Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop 2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06 Tiny fix in using v_tiles[] index 2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving 2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving 2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5 Load Q directly from global memory to registers for BlockGemm 2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e Remove un-used including from default policy file 2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092 Move common codes to detail namespace from Problem class scope 2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1 Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f) 2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b Add support of loading QK tiles of hdim96 without padding to hdim128 2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1 Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile 2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e Rename WarpTile in fwd setting 2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966 Simplifying the codes in defining KDram and QDram tile distribution 2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93 Tiny update in GetMaxVectorSize() 2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80 Rename and add scripts for testing hdim96 2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1 Enable hdim96 instances 2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54 Fix with regard to define stride in MakeKLdsBlockDescriptor() 2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV 2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb Further correction with regard to using n0_loops and k1_loops 2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8 Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning 2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857 Simplify the codes in block_gemm 2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519 Further clarification in using kSubQKHeaddim and kQKHeaddim 2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee Clarify the using of kSubQKHeaddim and kQKHeaddim 2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines 2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be Tiny fix in GetQKBlockGemm 2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d Enable the using of WarpTile-32x32x16 and add scripts to verify 2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e Add static_assert and comments in the with_softmax pipelines 2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950 2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline 2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7 Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md 2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window 2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857 Merge branch 'develop' into hstu_attention_mi350_fwd_bwd 2025-11-23 04:20:53 +00:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa [ck_tile] Pooling example - Improved tile sizes (#3233)
* improved tile sizes

- modified tile sizes for improved example performance

* Update example/ck_tile/36_pooling/pool3d.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-19 15:30:18 +01:00