composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Author	SHA1	Message	Date
AMD-dteng	31ec8bd33a	use builtin function to wait A load's data	2025-09-17 06:54:32 -05:00
Feng Shijie	186c69960e	Change the preshuffle pattern of MFMA_16x16x128_F8	2025-09-15 06:07:34 +00:00
Feng Shijie	fdbb22d2b6	Add async_load for tile_scatter_gather	2025-09-12 05:32:45 +00:00
Feng Shijie	a44d35172d	Add async_load for tile_scatter_gather	2025-09-11 07:39:41 +00:00
Feng Shijie	2e36b3c3ff	Add buffer_load_lds synchronization	2025-09-11 07:06:52 +00:00
Feng Shijie	1c05d3be89	Disable hot schduler when enable buffer_load_lds	2025-09-11 06:40:11 +00:00
Feng Shijie	f4fdaedf4c	Add macro option to enable BUFFER_LOAD_LDS	2025-09-11 05:42:27 +00:00
Feng Shijie	e7c1c77120	Enable async_load in the F16xF4_flatmm_pipelineo on gfx950	2025-09-10 11:41:54 +00:00
Feng Shijie	b50c8ee350	enable async_load on gfx950 for a16w4 flatmm pipeline	2025-09-09 08:44:20 +00:00
Feng Shijie	9efde19691	Refactor scale and bias function, encapsulate scale/bias pointer to a tensor view and load data by tilewise operation	2025-09-09 03:52:03 +00:00
coderfeli	946be930dd	update	2025-09-07 08:22:07 +00:00
Feng Shijie	65b702454c	support swiglu activaion and use rcpf to accelerate silu	2025-08-26 12:32:29 +00:00
root	d05eed931d	add line to last	2025-08-22 04:01:59 -05:00
root	d69cab7f0c	adjust A_LDS descriptor to avoid bankconflict	2025-08-22 03:20:46 -05:00
root	65989e940c	enable hotloop	2025-08-21 09:46:52 -05:00
Feng Shijie	9fbcc8f8a4	use v4i32 as the storage type for B to avoid repack operation	2025-08-20 13:53:32 +00:00
Feng Shijie	c27eb0771a	optimize cvt_pkf4_to_f16 implementation	2025-08-20 04:39:14 +00:00
Feng Shijie	3ca0bd500a	optimize A_LDS descriptor to avoid bankconflict	2025-08-19 14:56:46 +00:00
Feng Shijie	be55c0f9cb	add fp16xf4 moe	2025-08-18 17:28:11 +00:00
Feng Shijie	599e1f5b32	rename example	2025-08-17 17:51:18 +00:00
Feng Shijie	7899fb4a8d	remove additional check when e8m0->float	2025-08-15 06:20:46 +00:00
Feng Shijie	714b341797	eliminate repeat dequant	2025-08-14 09:34:12 +00:00
Feng Shijie	53e8c0c533	Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm	2025-08-13 16:51:49 +00:00
Feng Shijie	5de6208952	update f16xMXF4	2025-08-13 16:16:48 +00:00
Feng Shijie	732ebdee8b	update scale-preshuffle for MXF4	2025-08-13 10:48:53 +00:00
Feng Shijie	edb58d0680	update	2025-08-11 11:24:34 +00:00
Feng Shijie	200a11afc8	update scale for mxfp4	2025-08-11 07:59:47 +00:00
Feng Shijie	f788d3d629	add mixed_prec fp16xfp4	2025-08-08 20:19:16 +00:00
Feng Shijie	3dea10a277	debug mixed_prec flatmm	2025-08-07 09:22:04 +00:00
Feng Shijie	6d3cbc7c0e	add moe_flatmm	2025-08-06 08:33:33 +00:00
coderfeli	c0cb4d036d	fix split k	2025-08-06 02:45:31 +00:00
Feng Shijie	3f43b841d4	prune debug message	2025-07-30 06:37:26 +00:00
Feng Shijie	2e5d4c74cd	fix compile error	2025-07-30 04:52:08 +00:00
Feng Shijie	c117a1986a	Add persistent option on flatmm for tuning	2025-07-29 15:42:58 +00:00
AMD-dteng	a587701117	update pipeline v1: add atomic IGLP schedule	2025-07-29 14:59:32 +00:00
Feng Shijie	1b6d7cf407	crz idea	2025-07-28 08:24:51 +00:00
Feng Shijie	5473f06461	Add permuteN optimzization when NRepeat % 2 == 0 on flatmm	2025-07-27 11:57:38 +00:00
lalala-sh	1239d8a546	merge flatmm -scale	2025-07-24 08:46:51 +00:00
Feng Shijie	b908f5e803	fix flatmm syntax error on gfx950	2025-07-23 19:12:31 +00:00
Feng Shijie	5a1183ebbd	support flatmm scaling	2025-07-23 19:04:22 +00:00
valarLip	89fa639207	merge flatmm pipe v0 from dteng_flatmm_opt	2025-07-23 09:50:33 +00:00
lalala-sh	3f7d848dd3	build pass	2025-07-23 15:38:12 +08:00
lalala-sh	6dacf833da	fix bug	2025-07-23 07:20:26 +00:00
lalala-sh	7e1bd4b839	sync	2025-07-23 15:01:53 +08:00
valarLip	46a538e39e	adaptive scheduler instead of Macro definition	2025-07-23 13:40:43 +08:00
lalala-sh	9aa3396a79	fix tail handler bug	2025-07-17 08:40:35 +00:00
lalala-sh	fb76450e63	merge from dteng_flatmm_opt	2025-07-16 10:12:19 +00:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
linqunAMD	37e1a27537	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-25 01:07:45 -07:00
Satyanvesh Dittakavi	4c57157d50	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-06-17 11:54:30 -07:00

1 2

66 Commits