Commit Graph

1069 Commits

Author SHA1 Message Date
Qianfeng Zhang
f2a555dac7 Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717 Add softmax selection to two of the testing scripts 2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling 2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q 2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking 2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7 Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop 2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06 Tiny fix in using v_tiles[] index 2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving 2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving 2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5 Load Q directly from global memory to registers for BlockGemm 2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e Remove un-used including from default policy file 2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092 Move common codes to detail namespace from Problem class scope 2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1 Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f) 2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b Add support of loading QK tiles of hdim96 without padding to hdim128 2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1 Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile 2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e Rename WarpTile in fwd setting 2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966 Simplifying the codes in defining KDram and QDram tile distribution 2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93 Tiny update in GetMaxVectorSize() 2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80 Rename and add scripts for testing hdim96 2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1 Enable hdim96 instances 2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54 Fix with regard to define stride in MakeKLdsBlockDescriptor() 2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV 2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb Further correction with regard to using n0_loops and k1_loops 2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8 Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning 2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857 Simplify the codes in block_gemm 2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519 Further clarification in using kSubQKHeaddim and kQKHeaddim 2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee Clarify the using of kSubQKHeaddim and kQKHeaddim 2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines 2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be Tiny fix in GetQKBlockGemm 2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d Enable the using of WarpTile-32x32x16 and add scripts to verify 2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e Add static_assert and comments in the with_softmax pipelines 2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950 2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline 2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7 Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md 2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window 2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857 Merge branch 'develop' into hstu_attention_mi350_fwd_bwd 2025-11-23 04:20:53 +00:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa [ck_tile] Pooling example - Improved tile sizes (#3233)
* improved tile sizes

- modified tile sizes for improved example performance

* Update example/ck_tile/36_pooling/pool3d.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-19 15:30:18 +01:00
John Shumway
ad57f6ef0b [CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace.

The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.

Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.

There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate  kernels from either template library.

* Add using declarations to test code.

After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.

* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Aviral Goel
ac70206b2c feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… (#3225)
* feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test

* docs: Update CHANGELOG.MD
2025-11-18 09:32:27 -05:00
Yi DING
b6720531de [CK_TILE] MX Flatmm Split kernel instances (#3207)
* [CK_TILE] MX Flatmm Split kernel instances

* Fix flatmm example compile
2025-11-18 13:46:30 +08:00
Qianfeng Zhang
b75077475b Remove useless codes in the two trload pipelines 2025-11-15 13:48:50 +00:00