composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 04:07:56 +00:00

Author	SHA1	Message	Date
zhiming.ding	178ef223d2	Merge remote-tracking branch 'origin/develop' into shared/big_day_merge_asm_fix	2025-06-03 01:44:30 +00:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
lalala-sh	7e10e4d1d9	Merge branch 'develop' into wjx/mxfp4_moe_2Stages	2025-05-28 05:01:13 -05:00
root	b0c58577db	Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wjx/mxfp4_moe_2Stages	2025-05-28 06:31:56 +00:00
lalala-sh	50764c6662	Merge branch 'develop' into wjx/mxfp4_moe_2Stages	2025-05-28 01:25:55 -05:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00
Aviral Goel	c52649ad57	Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234 ) * added catch statements to examples * clang format	2025-05-27 22:32:42 -07:00
mtgu0705	2f0ee8ccb1	change the gemm1 tile from 64x128x128 to 128x64x128	2025-05-27 20:43:38 -05:00
mtgu0705	52b764d59f	update MX moe GEMM1 hotloopscheduling	2025-05-27 20:43:22 -05:00
Zzz9990	ece38b9d7a	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit `aa9847e42d`. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com>	2025-05-26 19:17:18 +08:00
Illia Silin	8146e471f1	fix the buffer intrinsic names for clang >=20 (#2228 )	2025-05-23 14:58:25 -07:00
feifei14119	2e39bf06f7	fix typo	2025-05-23 11:23:01 +00:00
mtgu0705	2216ff0521	update mx moe gemm1 gemm2 TF and BW calculation	2025-05-23 05:29:39 -05:00
mtgu0705	d6bfdc9d7d	update mx moe gemm1_bns tile size to 64x128x256	2025-05-23 05:10:45 -05:00
feifei14119	ce4e7b39da	gemm1 func pass	2025-05-23 09:26:38 +00:00
Illia Silin	1b846143c6	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `58f9e9ffbc`.	2025-05-22 15:41:17 -07:00
SamiAario-AMD	380bca2b85	Fix 11_add_rmsnorm2d_rdquant (#2207 )	2025-05-20 15:15:28 -07:00
Thomas Ning	1386924749	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format	2025-05-20 15:05:08 -07:00
mtgu0705	eb588bc083	update the TFlops calculation in the example	2025-05-20 10:00:13 -05:00
Sami Remes	d1e6f0982d	[CK_TILE] Grouped GEMM tile loop (#2146 ) * Add trait to use a persistent kernel and split the entrypoints in grouped gemm * Some helper functions for persistent kernel case * Get max occupancy grid using device properties * Implement tile loop in main entry point to grouped gemm * Enable GridSize() on device * Handle offset tile index using real current block index * Add persistent kernel choice to grouped gemm example * Use a for-loop for iterating over the group * Reduce VGPR spills by early-exit * Enable persistent kernel choice in grouped_gemm example * Add persistent kernel option to grouped_gemm test * Fix formatting with remod.py * Remove GridUpdateBlocks as blocks are now iteratively computed * Add comment about VGPR spilling * Fix formatting * Use CK_TILE_HOST instead of __host__ * Enable all Row/Col combinations in grouped gemm unit test * Add some KBatch=2 cases to grouped gemm tests * Fix SplitK for grouped gemm * Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm * Add type traits * Split examples to regular and tileloop * Formatting * Use hipExtStreamGetCUMask to get current active CUs for the given stream * Align test and example kernel config, and disable validation for splitk repeats * Remove debug options from CMakeLists.txt * Separate the code paths for persistent/non-persistent in test * Fix formatting * Address review comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-20 17:18:57 +03:00
mtgu0705	582dc7f842	enable fp4 moe no weigth preshuffle, function pass	2025-05-20 08:18:22 -05:00
mtgu0705	589e1dfea9	init mx fp4 B no preshuffle version	2025-05-20 04:40:22 -05:00
mtgu0705	680de28f77	commit with debug info	2025-05-19 21:47:35 -05:00
Aviral Goel	c4929225f6	remove debug statements from CMakeLists (#2204 )	2025-05-19 17:31:04 -07:00
jefyang1	f18170064d	Use new mfma instructions for FP8 on gfx950 (#2202 ) * Add logic to use new mfma instructions for fp8 bf8 * Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format * Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> * Fix intrin_mfma f8 calls due to merge mistake --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-19 17:29:51 -07:00
jefyang1	b8b12bb81e	Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203 ) * Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 * Run clang format	2025-05-19 14:25:50 -07:00
mtgu0705	2e6fafaf75	updated code, build passed.	2025-05-18 22:29:32 -05:00
mtgu0705	a4b5a374b9	Merge remote-tracking branch 'origin/wip-f4-pk' into mx_moe_f4_scale_shuffle	2025-05-17 09:49:24 -05:00
mtgu0705	d4aaf9d0d0	Merge remote-tracking branch 'origin/moe_mx_fp4_for_aiter' into mx_moe_f4_scale_shuffle	2025-05-17 09:30:48 -05:00
mtgu0705	eeeba8901f	update code	2025-05-17 09:28:26 -05:00
OscarXu	6fb2b54ff4	mx_fp4 default parameter change	2025-05-17 01:03:28 -05:00
mtgu0705	94fb9190be	init moe mx f4 scale shuffle	2025-05-16 14:46:09 -05:00
aska-0096	248e287866	generalize the pipeline scheduling.	2025-05-16 10:41:59 +00:00
aska-0096	a0379d81e7	modify the way we represent fp4	2025-05-16 09:44:04 +00:00
OscarXu	ec8d00d58d	mx_moe_fp4 ready for aiter with clang-format.	2025-05-16 04:09:26 -05:00
OscarXu	39ff3fbf05	v3 function pass	2025-05-16 03:42:48 -05:00
OscarXu	c5be9a501b	v1 function pass.	2025-05-16 03:16:38 -05:00
aska-0096	a1bec7670a	tempsave	2025-05-16 08:14:56 +00:00
OscarXu	efd3c24587	minor fix	2025-05-16 01:02:51 -05:00
Po Yen Chen	8cb0474b3d	Use only qr_async pipeline for batch_prefill (#2195 )	2025-05-15 11:47:29 -07:00
OscarXu	f70f778e27	v1 compile pass. Function not ready	2025-05-15 08:01:56 -05:00
Ding, Yi	9009d75c7a	Pack e8m0 as int32_t	2025-05-15 09:12:17 +00:00
aska-0096	062e16d54a	Improve the pipeline	2025-05-15 09:08:36 +00:00
OscarXu	68dbe558df	compile error fix	2025-05-15 16:55:20 +08:00
OscarXu	c0babbca62	Merge remote-tracking branch 'origin/fp4_gu_moe' into fp4_gu_moe_gemm1	2025-05-15 16:12:19 +08:00
OscarXu	17922821ec	Add gemm1 v1	2025-05-15 16:11:43 +08:00
mtgu0705	4e2ec31e4d	rename moe block selector and pipeline	2025-05-15 02:12:50 -05:00

1 2 3 4 5 ...

641 Commits