Qianfeng Zhang
8408ec0a02
Add scripts for testing the using of separate sequence lengths for k/v
2025-11-02 03:16:22 +00:00
Qianfeng Zhang
17e404be3b
Support separate sequence lengths for q and kv
2025-11-02 03:14:53 +00:00
Qianfeng Zhang
eaf9650fed
Use separate pipelines for using or not-using softmax situations
2025-10-30 10:01:52 +00:00
Qianfeng Zhang
207e6f10b8
Implementation of hstu attention pipeline using trload for v on mi350
2025-10-29 15:45:14 +00:00
Qianfeng Zhang
a464269bb6
Fix in the comments
2025-10-27 10:47:40 +00:00
Qianfeng Zhang
4eeb5cc917
Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding
2025-10-27 10:47:23 +00:00
Qianfeng Zhang
98a241a2eb
Using separate tile settings for no-softmax and with-softmax hstu attention situations
2025-10-24 01:47:55 +00:00
Qianfeng Zhang
7c4012266a
Update to benchmark scripts to consider for using softmax
2025-10-23 10:09:37 +00:00
Qianfeng Zhang
d1505786f8
Add support of softmax in hstu attention
2025-10-20 14:26:55 +00:00
Qianfeng Zhang
a874839dc2
Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType
2025-10-20 14:26:29 +00:00
Qianfeng Zhang
1a8f2f21fb
Move scaling by attn_scale to inside the main-loop
2025-10-20 14:22:18 +00:00
Qianfeng Zhang
bbda3f6f1c
Let IsTokenPairInsideMask() return bool type
2025-10-20 14:21:26 +00:00
Qianfeng Zhang
fdb89d3e2f
Add instances to consider for adding softmax support
2025-10-20 14:20:54 +00:00
Qianfeng Zhang
2072e53d1e
Remove K0 from tile setting since it is not used
2025-10-14 07:15:26 +00:00
Qianfeng Zhang
22a7b31865
Change to pipeline so that it is easier to add support of using softmax
2025-10-12 06:09:55 +00:00
Qianfeng Zhang
d308b09fae
Remove using IGLP method for instruction scheduling for kUseLocal true path
2025-10-12 06:09:25 +00:00
Qianfeng Zhang
6b40ce4074
Fix in GetQKBlockGemm()
2025-09-27 14:59:32 +00:00
Qianfeng Zhang
27b96b15c4
Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm
2025-09-25 15:38:55 +00:00
Qianfeng Zhang
bd32cc0de0
Remove useless constant statement in the kernel
2025-09-19 07:24:29 +00:00
Qianfeng Zhang
db62a9f47e
Remove un-necessary HSTU_CHECK() callings
2025-09-13 16:39:14 +00:00
Qianfeng Zhang
2427426640
Add HSTU_CHECK() and use it in example codes
2025-09-13 16:38:33 +00:00
Qianfeng Zhang
a5b7360862
Smalle update in reference hstu attention
2025-09-13 06:53:54 +00:00
Qianfeng Zhang
798fc3cc0b
Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size
2025-09-12 09:11:47 +00:00
Qianfeng Zhang
7d10353fda
Unify the license statements on all the source files
2025-09-11 10:27:00 +00:00
Qianfeng Zhang
1c030e8c3c
Remove using MakeKargsImpl() to simplify the hstu kernel
2025-09-10 15:28:12 +00:00
Qianfeng Zhang
72eb4e95d8
Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported
2025-09-09 12:55:01 +00:00
Qianfeng Zhang
f8dea2bc86
Use set_slice_tilie() to replace direct thread_buffer assignment
2025-09-09 12:54:32 +00:00
Qianfeng Zhang
4bf65d9fe5
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd and change in using ck_tile::make_kernel
2025-09-01 07:35:35 +00:00
Aviral Goel
fcff0043ae
chore(gemm): clang format to pass CI ( #2758 )
2025-08-29 00:38:46 -07:00
Vijay Krish
4208e28988
ck_tile kernel for gemm with groupwise quantized B tensor. ( #2663 )
...
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com >
Co-authored-by: ThomasNing <thomas.ning@amd.com >
Co-authored-by: Cong Ma <congma13@amd.com >
2025-08-28 23:43:02 -07:00
Cong Ma
428090f749
Support transposed C tile in Aquant ( #2679 )
...
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
2025-08-28 13:28:09 -07:00
asleepzzz
038ea82315
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit ead4447b20 .
2025-08-28 22:50:42 +08:00
linqunAMD
4a49dac7c6
[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel ( #2728 )
...
* fix copy basic build error
* fix other ck tile test build error
2025-08-28 20:30:30 +08:00
Yi DING
ead4447b20
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
2025-08-28 18:54:18 +08:00
Aviral Goel
f5f795c4d6
feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types ( #2691 )
...
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
2025-08-27 18:17:24 -07:00
Cong Ma
cd53e2e57e
[CK TILE GEMM] Fix a merge conflict ( #2753 )
...
* Fixed a merge conflict in 245467f3
* Foramt the code
2025-08-27 11:08:09 -07:00
Cong Ma
245467f359
[CK TILE] Fix bugs in AQuant preshuffle ( #2700 )
...
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
2025-08-27 00:05:54 -07:00
Tianyuan Wu
e9605ed36d
[CK_TILE] Fix the Wrong Output Generated by Gemm Examples on GFX11/12 ( #2713 )
...
* Introduce macro CK_TILE_USE_WMMA
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Make CK_TILE_USE_WMMA global for all examples
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Remove CK_TILE_USE_WMMA from config.hpp
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
---------
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-08-25 12:55:35 -07:00
Yi DING
4cfa2c7158
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
2025-08-22 10:01:10 +08:00
Qianfeng Zhang
d4f43f0653
Use xor transform to implement Q/K Lds descriptor for kKpack == 8 cases
2025-08-21 14:16:34 +00:00
Bartłomiej Kocot
4212bbc170
[CK Tile] Grouped convolution backward data ( #2652 )
...
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com >
2025-08-20 05:29:57 -07:00
Qianfeng Zhang
19fc2a9051
Remove selectable VLayout for simplifying the codes since hdim is always fatest dimension
2025-08-20 09:06:01 +00:00
Haocong WANG
81b265cf91
[CK_TILE] Update the fmhafwd dispatch logic ( #2698 )
...
* update the fmhafwd dispatch logic
* Fix fmha test scripts
* Fix bash
---------
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-08-20 16:24:43 +08:00
joyeamd
a1589a9667
fix grouped gemm example when wave32 enabled ( #2707 )
...
1, delete some unused variables
2, fix BlockSize when wave32 enabled
2025-08-19 16:20:43 +08:00
mirchen-amd
60320e90c1
Mirchen/gemm blockscale wp segfault fix ( #2638 )
...
* Add stride validation to prevent segfault in blockscale GEMM
* run clang-format
* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* added stride length checking to more gemm examples in ckprofiler
* ran clang format
* added validation header and implement in core gemm operations
* remove ck_tile transpose and gemm stages from CI (#2646 )
* update CK build instruction step 4 (#2563 )
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
* Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 )
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
* enable aiter test_mha in daily CI (#2659 )
* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 )
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628 )
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
* Fix num_byte calculations to use nhead_k for K & V size (#2653 )
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
* [CK_TILE] FMHA BWD Decode Pipeline (#2643 )
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
* fix (#2668 )
* Optimize fmha fwd decode & prefill for gfx950 (#2641 )
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 )
This reverts commit b7322a521a .
* added batch stride checking to batched gemm ops in profiler
* removed batch stride validation
* removed batched stride validation again
* Update include/ck/library/utility/profiler_validation_common.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* refactor function names
* added gemm stride checking to more profiler gemm operations
* run clang format
* add stride checkign to 01 gemm example
* rename from profiler to validation common, used for examples and profiler
* build of ckProfiler success
* update file headers
---------
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com >
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com >
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com >
Co-authored-by: Haocong WANG <haocwang@amd.com >
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-08-19 01:19:17 -07:00
Qianfeng Zhang
15e6be5c79
Using separate settings for gfx942 and gfx950
2025-08-19 08:01:41 +00:00
Max Podkorytov
f38751fc2a
invoke script directly ( #2687 )
2025-08-19 00:23:07 -07:00
linqunAMD
9fcc1ee9fd
Support Wave32 in CK_TILE - Part 1 ( #2594 )
...
* Support wave32/wave64 in CK_TILE - Part 1
* remove blocksize in kernel launch
* fix build error
* fix clang format
* fix clang format 2
* fix clang format 3
* fix fmha build error
* fix fmha build 2
* fix fmha build 3
* fix build error 4
* address review comment
* update change log
* replace KernelBlockSize with kBlockSize
* fix CI fail
* fix clang format
* address review comment and rebase code.
* fix universal test fail
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-08-18 10:08:31 -07:00
Qianfeng Zhang
7b68b6eebb
Tiny change in pipeline BlockGemm definition to adapt to the latest merging with develop branch
2025-08-18 14:51:26 +00:00
Qianfeng Zhang
89cd5ff5fe
Merge branch 'develop' into hstu_attention_n0loop_fused_unroll
2025-08-18 14:48:44 +00:00