composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-04 21:51:28 +00:00

Author	SHA1	Message	Date
aska-0096	75f6f6bac4	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 09:04:41 +00:00
Yi DING	8e1eb0c1ee	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2	2025-08-12 17:02:52 +08:00
Cameron Shinn	352f87e684	Fix num_byte calculations to use nhead_k for K & V size (#2653 ) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ```	2025-08-12 13:44:01 +08:00
aska-0096	96d24497f5	fix conflict. disable all v-col instance for fmha fwd	2025-08-12 04:02:41 +00:00
aska-0096	1716171be4	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 03:52:34 +00:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
aska-0096	498d234ab8	change the warp setting for hdim32 fmha fwd	2025-08-11 15:37:37 +00:00
aska-0096	8c101ccb88	fix bug on non-gfx950	2025-08-08 18:35:53 +00:00
aska-0096	106edeecd9	remove non-necessary change	2025-08-08 09:07:40 +00:00
aska-0096	78edd7303b	bug fix, clang format;	2025-08-08 09:04:02 +00:00
aska-0096	3b9fb6af38	Remove unnecessary changes	2025-08-08 08:08:03 +00:00
aska-0096	6bb57c2c57	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-08 07:50:12 +00:00
aska-0096	1ecee378d5	remove unnecessary files; rename some files	2025-08-08 06:19:31 +00:00
aska-0096	b4640a9de6	merge fa_decode pipeline into fmha_fwd api	2025-08-08 05:46:18 +00:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Yi DING	15e8b6ccf7	[CK_TILE] Fix FMHA qr_async causing errors in FA (#2627 )	2025-08-06 20:04:23 +08:00
aska-0096	0d12fc944f	Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA	2025-08-04 10:27:42 +00:00
aska-0096	746f4ccb99	Load Q through lds, implement xor;	2025-08-04 06:49:01 +00:00
aska-0096	2d4e73d2b4	small refactor	2025-08-01 10:44:54 +00:00
aska-0096	a28b6e67fe	upgrade prefill pipeline; simple iglp; consistent data produce and consume order	2025-07-31 10:25:37 +00:00
aska-0096	75cba48682	enable larger tile size; upgrade xor pattern	2025-07-31 05:13:27 +00:00
aska-0096	69890afc98	remove all lds bankconflict with xor layouts	2025-07-30 12:25:33 +00:00
aska-0096	8dacc35c4c	enable prefill overload operator().	2025-07-30 03:51:06 +00:00
rocking	01642ca8b1	set default optdim (#2580 )	2025-07-29 13:44:10 +08:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Andres Lugo	7fe50dc3da	Remove filter for only batch on receipt 4 (#2574 ) Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA	2025-07-28 14:53:24 -07:00
rocking	b36e0b029f	[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539 ) * uncomment all the headdim, use optdim to control * change default back to -1 * uncomment splitkv instance * Fix typo in receipt 4 for appendkv * support optdim for bwd, splitkv and appendkv * Fix 192 key error --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>	2025-07-28 17:16:32 +08:00
aska-0096	13bcc913de	fix the lds alignment caused performance regression	2025-07-25 07:10:01 +00:00
Illia Silin	1b6f024836	refactor fmha_bwd.py (#2546 )	2025-07-23 09:09:56 -07:00
aska-0096	14e0ab70c6	tempsave. asynccopy+trload sanity checked	2025-07-22 08:04:05 +00:00
aska-0096	1b468bac0b	tempsave, trload+asyncload done	2025-07-21 05:55:55 +00:00
aska-0096	afd96d8180	compile pass	2025-07-18 10:04:34 +00:00
aska-0096	5616551115	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-07-18 05:17:27 +00:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
aska-0096	94b6430489	temp save	2025-07-17 10:06:09 +00:00
aska-0096	7e330553dc	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline	2025-07-17 07:24:32 +00:00
slippedJim	05b65d0c7c	update (#2519 )	2025-07-17 15:24:19 +08:00
Andres Lugo	aadeffde18	Update FMHA recipe for Pytorch SDPA integration (#2480 ) * Add receipts in splitk and appendk * remove grouped * Remove logits --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 09:00:23 -07:00
aska-0096	18669925cc	temp save, change all instance to 1wave	2025-07-10 04:29:33 +00:00
Po Yen Chen	ad9863fe05	[CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402 ) * Wrap tile size mapping as class method * Warp pipeline generating as class method * Add constraint as kernel dispatching criteria * Support mutltiple tile size for a (hdim, hdim_v) combination * Use smaller tile size if CU utilization is low * Use integar as the key of the tile size map * Fix type error * Simply override parent class method return value * Add attribute to eliminate warnging * Allow using environment variables to turn on/off custom factory * Unify param naming style * Add missing HIP runtime include directive * Fix os.environ.get() usage	2025-07-09 22:01:33 +08:00
aska-0096	18686cfe5b	tempsave, fmha_decode	2025-07-08 08:37:20 +00:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
rahjain-amd	ad593c286f	Fixing Debug build (#2404 ) Failed to build `tile_example_fmha_bwd` due to below error ``` /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here 989 \| return run<FmhaBwdFp16>(arg_parser) ? 0 : -2; \| ^ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here 993 \| return run<FmhaBwdBf16>(arg_parser) ? 0 : -2; \| ^ 2 errors generated when compiling for gfx942. ``` Fixed with proper cast	2025-07-07 14:46:22 +05:30
ltqin	9f4c5d7372	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-07 16:16:54 +08:00
Po Yen Chen	50fad03524	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs()	2025-06-25 15:19:21 +08:00
Anton Gorenko	77123600ee	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices.	2025-06-24 07:45:24 -07:00
Yi DING	b8212864cf	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-24 01:33:31 +08:00
Linjun-AMD	61eb622e85	update the way to compute fmha fwd tflop, include mask type (#2386 ) * update the way to compute fwd tflop, include mask type Signed-off-by: JL-underdog <Jun.Lin@amd.com> * remove unneccessary comment * add necessary comment * remove some comment --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu>	2025-06-23 15:53:58 +08:00
aska-0096	47565f21a5	temp save, waiting for debug	2025-06-21 15:02:57 +00:00
aska-0096	e0a634ef97	save an example for __bf16 type	2025-06-19 05:11:52 +00:00

1 2 3

115 Commits