composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
OscarXu	52d68c9529	flag and barrier fix for copmiler branch MainOpSelV3	2025-05-29 03:13:11 -05:00
OscarXu	653bc83f8a	Remove rocm6.3 workaround flags and macro	2025-05-28 21:05:21 -05:00
OscarXu	772debdf8f	Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format	2025-05-28 18:29:06 +08:00
OscarXu	fc9ef98e7b	Add gemm2 64x128x128 asm. Fix BF16 ref.	2025-05-21 16:57:57 +08:00
OscarXu	9fdfff82ea	Fix v2 topk_weight cal. Add silu asm.	2025-05-20 13:42:06 +08:00
OscarXu	f87973a4ac	Merge remote-tracking branch 'origin/moe_bs_stage1_dev' into moe_merge_v3_bs_for_aiter	2025-05-19 19:34:42 +08:00
OscarXu	a4ed178b4c	Add gemm2 v3 64x128x128	2025-05-19 19:33:04 +08:00
OscarXu	215ddb0be0	Merge remote-tracking branch 'origin/moe_bs_stage1_dev' into moe_merge_v3_bs_for_aiter	2025-05-17 04:35:43 -05:00
OscarXu	9c9bf1aece	Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler.	2025-05-17 04:33:52 -05:00
OscarXu	e4516b71c9	Merge branch 'moe_bs_stage1_dev' into moe_merge_v3_bs_for_aiter	2025-05-17 00:43:46 -05:00
OscarXu	98606fad94	Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into moe_bs_stage1_dev	2025-05-14 20:48:42 -05:00
OscarXu	c84ad83905	v1 fix for mi350.	2025-05-14 08:05:17 -05:00
valarLip	5f81710346	fix blockscale gemm bug	2025-05-14 08:23:49 +00:00
OscarXu	9ecbcce2e9	Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into moe_merge_v3_bs_for_aiter	2025-05-14 15:05:08 +08:00
OscarXu	b2c0c3cfeb	gemm1 v3 64x128x128 debug	2025-05-14 10:07:11 +08:00
OscarXu	50590b8f17	Add gemm1 v1 32x128x128	2025-05-13 16:46:51 +08:00
valarLip	eaebefb278	update moe v1 pipeline	2025-05-13 08:42:49 +00:00
OscarXu	9ead312164	Gemm1 GUFusion function pass. Perf WIP	2025-05-13 15:32:39 +08:00
OscarXu	19aa39b978	Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into moe_bs_stage1_dev	2025-05-13 10:46:40 +08:00
OscarXu	f392874e3d	revert unexpected file change	2025-05-13 09:55:33 +08:00
OscarXu	fe8bb251da	gemm1 up-only pass. GU WIP	2025-05-12 21:08:42 +08:00
OscarXu	0a845c0b7d	Merge remote-tracking branch 'origin/moe_blockscale_dev_gfx950' into moe_merge_v3_bs_for_aiter. Manual fix for thread_group_tensor_slice_transfer_v7r3_scatter	2025-05-12 19:25:39 +08:00
valarLip	3ea573acd6	update moe	2025-05-12 08:10:50 +00:00
valarLip	037a5ff76c	update moe	2025-05-12 06:36:23 +00:00
OscarXu	f1a534f6e7	Compile pass. Gemm1 function WIP	2025-05-12 13:20:16 +08:00
lalala-sh	d9ab2fe4f5	update moe	2025-05-12 05:08:58 +00:00
OscarXu	3e5005a47b	fix compile error. Gemm2 pass. Gemm1 WIP	2025-05-12 11:21:14 +08:00
OscarXu	7729fdb349	temp save for gemm1. Function not ready	2025-05-12 09:32:58 +08:00
Khushbu Agarwal	d8faf1c6a1	Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172 ) * Changes for updating tile distribution for shuffle and transpose * Fixed swizzle and transpose, removed comments * clang formatted * Adding support for bf16 type * Addressing review comments	2025-05-10 22:40:05 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
jefyang1	6b1a339b6f	Fix grouped conv bwd data tests on gfx950 (#2173 )	2025-05-09 09:01:06 -07:00
OscarXu	03d5a50b45	Merge remote-tracking branch 'origin/wjx/align_v3_pipeline' into moe_bs_stage1_dev	2025-05-09 20:15:11 +08:00
Khushbu Agarwal	ef72a4b9bc	Disable SMFMA for gfx90a (#2182 )	2025-05-09 00:18:07 -07:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
lalala-sh	6c459e8c38	Merge branch 'develop' into wjx/align_v3_pipeline	2025-05-08 18:08:31 +08:00
lalala-sh	def952a178	use mem_op::set when topk=1	2025-05-08 09:49:16 +00:00
lalala-sh	960b2bce1c	update	2025-05-08 09:48:23 +00:00
OscarXu	586c2a9a15	Merge remote-tracking branch 'origin/f8blk_scale_opt' into moe_blockscale_dev	2025-05-08 16:39:08 +08:00
aska-0096	68c7fe843d	temp save, a performant version.	2025-05-08 07:44:06 +00:00
Thomas Ning	c757046d49	Revert "Disable the SMFMA instruction for gfx90a. (#2174 )" (#2175 ) This reverts commit `a32d907771`.	2025-05-08 00:07:03 -07:00
OscarXu	62877fa074	16x16 function merged to moe	2025-05-08 14:33:31 +08:00
Khushbu Agarwal	a32d907771	Disable the SMFMA instruction for gfx90a. (#2174 ) * remove smfma for gfx90a * clang formatted	2025-05-07 23:09:22 -07:00
BingYuan.Zhou	6a3960c1e1	Flatmm merge (#2168 ) * sync with function interface of cshuffleepiloge,fix flatmm build fail * move code from solin/flatmm which add mfma161632fp8 and optimize flatmm --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-08 12:59:57 +08:00
jakpiase	cb07ad84d5	fix for default epilogue (#2167 )	2025-05-07 10:46:53 -07:00
OscarXu	8021128572	merge origin/f8blk_scale_opt	2025-05-07 17:10:03 +08:00
OscarXu	1fae385575	add warmup to asm launch	2025-05-07 16:55:42 +08:00
Aviral Goel	769336b640	[CK_TILE] Add type traits to detect tile window types at compile time (#2158 ) * added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline * added type traits instead of enum to tile_window() and tile_window_linear() with debug comments * removed comments, added documentation and clang format	2025-05-07 00:00:39 -07:00
OscarXu	c989bbe3aa	Update v1_128x128x128 to 2x2 instead of 4x1	2025-05-07 14:24:30 +08:00
Rostyslav Geyyer	8a0d659f92	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-06 09:24:00 -05:00
carlushuang	4e9b76f88c	[CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153 ) * combine 2-3 as single stage * support zeroing * improve long tokens * update specialization * b16 ws * 8bit topk optimize * update 15 example	2025-05-06 17:32:07 +08:00

1 2 3 4 5 ...

867 Commits