aska-0096
75f6f6bac4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-12 09:04:41 +00:00
Yi DING
8e1eb0c1ee
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
2025-08-12 17:02:52 +08:00
Cameron Shinn
352f87e684
Fix num_byte calculations to use nhead_k for K & V size ( #2653 )
...
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
2025-08-12 13:44:01 +08:00
aska-0096
96d24497f5
fix conflict. disable all v-col instance for fmha fwd
2025-08-12 04:02:41 +00:00
aska-0096
1716171be4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-12 03:52:34 +00:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
aska-0096
498d234ab8
change the warp setting for hdim32 fmha fwd
2025-08-11 15:37:37 +00:00
aska-0096
8c101ccb88
fix bug on non-gfx950
2025-08-08 18:35:53 +00:00
aska-0096
106edeecd9
remove non-necessary change
2025-08-08 09:07:40 +00:00
aska-0096
78edd7303b
bug fix, clang format;
2025-08-08 09:04:02 +00:00
aska-0096
3b9fb6af38
Remove unnecessary changes
2025-08-08 08:08:03 +00:00
aska-0096
6bb57c2c57
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-08 07:50:12 +00:00
aska-0096
1ecee378d5
remove unnecessary files; rename some files
2025-08-08 06:19:31 +00:00
aska-0096
b4640a9de6
merge fa_decode pipeline into fmha_fwd api
2025-08-08 05:46:18 +00:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
Yi DING
15e8b6ccf7
[CK_TILE] Fix FMHA qr_async causing errors in FA ( #2627 )
2025-08-06 20:04:23 +08:00
aska-0096
0d12fc944f
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
2025-08-04 10:27:42 +00:00
aska-0096
746f4ccb99
Load Q through lds, implement xor;
2025-08-04 06:49:01 +00:00
aska-0096
2d4e73d2b4
small refactor
2025-08-01 10:44:54 +00:00
aska-0096
a28b6e67fe
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
2025-07-31 10:25:37 +00:00
aska-0096
75cba48682
enable larger tile size; upgrade xor pattern
2025-07-31 05:13:27 +00:00
aska-0096
69890afc98
remove all lds bankconflict with xor layouts
2025-07-30 12:25:33 +00:00
aska-0096
8dacc35c4c
enable prefill overload operator().
2025-07-30 03:51:06 +00:00
rocking
01642ca8b1
set default optdim ( #2580 )
2025-07-29 13:44:10 +08:00
Yi DING
1926cd0cb8
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-29 09:31:14 +08:00
Andres Lugo
7fe50dc3da
Remove filter for only batch on receipt 4 ( #2574 )
...
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA
2025-07-28 14:53:24 -07:00
rocking
b36e0b029f
[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control ( #2539 )
...
* uncomment all the headdim, use optdim to control
* change default back to -1
* uncomment splitkv instance
* Fix typo in receipt 4 for appendkv
* support optdim for bwd, splitkv and appendkv
* Fix 192 key error
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com >
2025-07-28 17:16:32 +08:00
aska-0096
13bcc913de
fix the lds alignment caused performance regression
2025-07-25 07:10:01 +00:00
Illia Silin
1b6f024836
refactor fmha_bwd.py ( #2546 )
2025-07-23 09:09:56 -07:00
aska-0096
14e0ab70c6
tempsave. asynccopy+trload sanity checked
2025-07-22 08:04:05 +00:00
aska-0096
1b468bac0b
tempsave, trload+asyncload done
2025-07-21 05:55:55 +00:00
aska-0096
afd96d8180
compile pass
2025-07-18 10:04:34 +00:00
aska-0096
5616551115
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-07-18 05:17:27 +00:00
Linjun-AMD
095393276a
h_dim256 fmha use async_qr pipeline ( #2510 )
2025-07-18 09:59:38 +08:00
aska-0096
94b6430489
temp save
2025-07-17 10:06:09 +00:00
aska-0096
7e330553dc
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
2025-07-17 07:24:32 +00:00
slippedJim
05b65d0c7c
update ( #2519 )
2025-07-17 15:24:19 +08:00
Andres Lugo
aadeffde18
Update FMHA recipe for Pytorch SDPA integration ( #2480 )
...
* Add receipts in splitk and appendk
* remove grouped
* Remove logits
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-07-10 09:00:23 -07:00
aska-0096
18669925cc
temp save, change all instance to 1wave
2025-07-10 04:29:33 +00:00
Po Yen Chen
ad9863fe05
[CK_TILE] Low CU utilization optimization for fMHA fwd kernels ( #2402 )
...
* Wrap tile size mapping as class method
* Warp pipeline generating as class method
* Add constraint as kernel dispatching criteria
* Support mutltiple tile size for a (hdim, hdim_v) combination
* Use smaller tile size if CU utilization is low
* Use integar as the key of the tile size map
* Fix type error
* Simply override parent class method return value
* Add attribute to eliminate warnging
* Allow using environment variables to turn on/off custom factory
* Unify param naming style
* Add missing HIP runtime include directive
* Fix os.environ.get() usage
2025-07-09 22:01:33 +08:00
aska-0096
18686cfe5b
tempsave, fmha_decode
2025-07-08 08:37:20 +00:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
rahjain-amd
ad593c286f
Fixing Debug build ( #2404 )
...
Failed to build `tile_example_fmha_bwd` due to below error
```
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here
989 | return run<FmhaBwdFp16>(arg_parser) ? 0 : -2;
| ^
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here
993 | return run<FmhaBwdBf16>(arg_parser) ? 0 : -2;
| ^
2 errors generated when compiling for gfx942.
```
Fixed with proper cast
2025-07-07 14:46:22 +05:30
ltqin
9f4c5d7372
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-07 16:16:54 +08:00
Po Yen Chen
50fad03524
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
2025-06-25 15:19:21 +08:00
Anton Gorenko
77123600ee
Improve fmha_bwd tests performance ( #2376 )
...
* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
2025-06-24 07:45:24 -07:00
Yi DING
b8212864cf
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-24 01:33:31 +08:00
Linjun-AMD
61eb622e85
update the way to compute fmha fwd tflop, include mask type ( #2386 )
...
* update the way to compute fwd tflop, include mask type
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* remove unneccessary comment
* add necessary comment
* remove some comment
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu >
2025-06-23 15:53:58 +08:00
aska-0096
47565f21a5
temp save, waiting for debug
2025-06-21 15:02:57 +00:00
aska-0096
e0a634ef97
save an example for __bf16 type
2025-06-19 05:11:52 +00:00