Yi DING
8e1eb0c1ee
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
2025-08-12 17:02:52 +08:00
Cameron Shinn
352f87e684
Fix num_byte calculations to use nhead_k for K & V size ( #2653 )
...
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
2025-08-12 13:44:01 +08:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
Yi DING
15e8b6ccf7
[CK_TILE] Fix FMHA qr_async causing errors in FA ( #2627 )
2025-08-06 20:04:23 +08:00
rocking
01642ca8b1
set default optdim ( #2580 )
2025-07-29 13:44:10 +08:00
Yi DING
1926cd0cb8
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-29 09:31:14 +08:00
Andres Lugo
7fe50dc3da
Remove filter for only batch on receipt 4 ( #2574 )
...
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA
2025-07-28 14:53:24 -07:00
rocking
b36e0b029f
[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control ( #2539 )
...
* uncomment all the headdim, use optdim to control
* change default back to -1
* uncomment splitkv instance
* Fix typo in receipt 4 for appendkv
* support optdim for bwd, splitkv and appendkv
* Fix 192 key error
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com >
2025-07-28 17:16:32 +08:00
Illia Silin
1b6f024836
refactor fmha_bwd.py ( #2546 )
2025-07-23 09:09:56 -07:00
Linjun-AMD
095393276a
h_dim256 fmha use async_qr pipeline ( #2510 )
2025-07-18 09:59:38 +08:00
slippedJim
05b65d0c7c
update ( #2519 )
2025-07-17 15:24:19 +08:00
Andres Lugo
aadeffde18
Update FMHA recipe for Pytorch SDPA integration ( #2480 )
...
* Add receipts in splitk and appendk
* remove grouped
* Remove logits
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-07-10 09:00:23 -07:00
Po Yen Chen
ad9863fe05
[CK_TILE] Low CU utilization optimization for fMHA fwd kernels ( #2402 )
...
* Wrap tile size mapping as class method
* Warp pipeline generating as class method
* Add constraint as kernel dispatching criteria
* Support mutltiple tile size for a (hdim, hdim_v) combination
* Use smaller tile size if CU utilization is low
* Use integar as the key of the tile size map
* Fix type error
* Simply override parent class method return value
* Add attribute to eliminate warnging
* Allow using environment variables to turn on/off custom factory
* Unify param naming style
* Add missing HIP runtime include directive
* Fix os.environ.get() usage
2025-07-09 22:01:33 +08:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
rahjain-amd
ad593c286f
Fixing Debug build ( #2404 )
...
Failed to build `tile_example_fmha_bwd` due to below error
```
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here
989 | return run<FmhaBwdFp16>(arg_parser) ? 0 : -2;
| ^
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here
993 | return run<FmhaBwdBf16>(arg_parser) ? 0 : -2;
| ^
2 errors generated when compiling for gfx942.
```
Fixed with proper cast
2025-07-07 14:46:22 +05:30
ltqin
9f4c5d7372
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-07 16:16:54 +08:00
Po Yen Chen
50fad03524
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
2025-06-25 15:19:21 +08:00
Anton Gorenko
77123600ee
Improve fmha_bwd tests performance ( #2376 )
...
* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
2025-06-24 07:45:24 -07:00
Yi DING
b8212864cf
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-24 01:33:31 +08:00
Linjun-AMD
61eb622e85
update the way to compute fmha fwd tflop, include mask type ( #2386 )
...
* update the way to compute fwd tflop, include mask type
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* remove unneccessary comment
* add necessary comment
* remove some comment
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu >
2025-06-23 15:53:58 +08:00
Aviral Goel
aed0f5880c
Label CMakeLists message() as DEBUG or STATUS for clean build output ( #2301 )
...
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
2025-06-10 10:46:47 -07:00
slippedJim
57f497452a
remove restriction of group mode hd192 no lse ( #2252 )
...
Co-authored-by: Jim <jimguo12@amd.com >
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9
[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization ( #2251 )
...
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Po Yen Chen
8cb0474b3d
Use only qr_async pipeline for batch_prefill ( #2195 )
2025-05-15 11:47:29 -07:00
Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00
Illia Silin
9a9f59ae69
Revert "Add ck tile examples to package ( #1880 )" ( #2150 )
2025-04-30 10:20:16 -07:00
jakpiase
434d19f696
Add ck tile examples to package ( #1880 )
...
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-04-28 09:53:19 -07:00
Po Yen Chen
3d4d70d2fc
Avoid using store_tile_raw() for fp32 tensors ( #2072 )
2025-04-26 23:07:41 -07:00
joyeamd
41541aff7a
SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added ( #2126 )
2025-04-25 16:31:09 +08:00
rocking
02ce6d39ea
Only generate specific hdim ( #2120 )
2025-04-24 18:52:58 +08:00
joyeamd
94d47b1680
fmha hdim256 vectorize improve ( #2086 )
...
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.
2025-04-16 09:21:04 +08:00
slippedJim
5f885d2b7a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2025-04-10 23:21:13 +08:00
slippedJim
5a22b61de5
Add new receipt ( #2055 )
2025-04-07 14:18:01 +08:00
rocking
8a20b62e91
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-28 21:58:06 +08:00
carlushuang
6c08c5c46d
add mask support in hdim=192/128 ( #1999 )
2025-03-21 18:28:43 +08:00
carlushuang
e3c9886cdf
[CK_TILE] return value with macro in ck_tile::kernel_launch API ( #1982 )
...
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992 )
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-03-20 11:00:29 -07:00
rocking
b819c217e4
Sync the kname with instance name ( #1989 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-20 00:06:45 +08:00
carlushuang
3e81279d26
Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" … ( #1971 )
...
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 )
This reverts commit 8cbcd3e0d0 .
* fix codegen problem
* Update config.hpp
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-03-13 11:41:39 +08:00
Illia Silin
8cbcd3e0d0
Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" ( #1969 )
...
This reverts commit 7a93b16ff6 .
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6
[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )
...
* support hdim=192/128 pair
* remove useless print
* update
2025-03-11 21:07:40 +08:00
Max Podkorytov
9e132eb77c
refactor ck-tile kernel launch ( #1925 )
2025-03-07 08:29:40 -08:00
Illia Silin
9b51c08bf7
remove support for gfx940 and gfx941 targets ( #1944 )
...
* remove support for gfx940 and gfx941 targets
* update changelog
2025-03-05 11:07:33 -08:00
rocking
faa2235dad
explicit show no feature in kernel name ( #1920 )
2025-02-28 14:23:30 +08:00
slippedJim
a9bcd3c98d
make fmha bwd api template for v2 & v3 ( #1918 )
...
* use template fmha_bwd function
* update
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-02-27 19:26:19 +08:00
rocking
e9ee568681
Apply filter to every kernel in the codgen of FMHA ( #1911 )
...
* add receipt for fwd
* Add receipt for bwd
* Use kernel name to avoid more receipt
* apply filter to every kernel
2025-02-26 20:20:29 +08:00
rocking
e4358c01d9
only output the deterministic bwd kernel for aiter ( #1903 )
...
* only output the deterministic kernel
* Add comment
2025-02-20 04:27:01 +08:00
rocking
f0d49d14fc
Add receipt 10~12 for codegen of aiter integration ( #1877 )
...
* Add receipt for aiter integration
* update receipt
* Add hdim 96 instances
* Revert "Add hdim 96 instances"
This reverts commit f339449f54 .
2025-02-19 09:01:08 +08:00
Andres Lugo
8086bbe3a7
Add receipt 4 option to codegen ( #1875 )
...
* Add receipt 4 option to codegen
* Remove repeated code
* Review comments
2025-02-11 10:11:46 -08:00