composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-17 19:40:04 +00:00

Author	SHA1	Message	Date
Feng Shijie	c27eb0771a	optimize cvt_pkf4_to_f16 implementation	2025-08-20 04:39:14 +00:00
Feng Shijie	3ca0bd500a	optimize A_LDS descriptor to avoid bankconflict	2025-08-19 14:56:46 +00:00
Feng Shijie	f7f0306eea	fix gate-up when GU_NRepeat > 1	2025-08-18 18:43:37 +00:00
Feng Shijie	be55c0f9cb	add fp16xf4 moe	2025-08-18 17:28:11 +00:00
Feng Shijie	599e1f5b32	rename example	2025-08-17 17:51:18 +00:00
Feng Shijie	7899fb4a8d	remove additional check when e8m0->float	2025-08-15 06:20:46 +00:00
Feng Shijie	714b341797	eliminate repeat dequant	2025-08-14 09:34:12 +00:00
Feng Shijie	53e8c0c533	Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm	2025-08-13 16:51:49 +00:00
Feng Shijie	5de6208952	update f16xMXF4	2025-08-13 16:16:48 +00:00
Feng Shijie	732ebdee8b	update scale-preshuffle for MXF4	2025-08-13 10:48:53 +00:00
Feng Shijie	edb58d0680	update	2025-08-11 11:24:34 +00:00
Feng Shijie	cc9c7b9e58	optimize gemm2 atomic_add pattern	2025-08-11 08:38:23 +00:00
Feng Shijie	200a11afc8	update scale for mxfp4	2025-08-11 07:59:47 +00:00
Feng Shijie	87aed564dc	update case construction	2025-08-11 07:56:14 +00:00
Feng Shijie	8b85fa6cf2	update granularity control	2025-08-11 06:03:06 +00:00
Feng Shijie	1b8c7097b8	fix TileConfig	2025-08-11 03:42:46 +00:00
Gino Lu	8ba1c708dc	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert	2025-08-10 07:42:42 +00:00
Feng Shijie	f788d3d629	add mixed_prec fp16xfp4	2025-08-08 20:19:16 +00:00
Feng Shijie	3dea10a277	debug mixed_prec flatmm	2025-08-07 09:22:04 +00:00
lalala-sh	0ba513b148	Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk fix split k	2025-08-06 16:49:47 +08:00
Feng Shijie	6d3cbc7c0e	add moe_flatmm	2025-08-06 08:33:33 +00:00
coderfeli	c0cb4d036d	fix split k	2025-08-06 02:45:31 +00:00
Feng Shijie	90e910f3a7	fix flatmm with scaling when WarpTileM == 32	2025-08-04 07:16:36 +00:00
Feng Shijie	aa5e008fa5	optimize scaling epilogue	2025-08-01 11:01:23 +00:00
Feng Shijie	ac5908c0bb	fix wrong config for fp8 scaling	2025-08-01 07:28:38 +00:00
Feng Shijie	3f43b841d4	prune debug message	2025-07-30 06:37:26 +00:00
Feng Shijie	2e5d4c74cd	fix compile error	2025-07-30 04:52:08 +00:00
Feng Shijie	c117a1986a	Add persistent option on flatmm for tuning	2025-07-29 15:42:58 +00:00
AMD-dteng	a587701117	update pipeline v1: add atomic IGLP schedule	2025-07-29 14:59:32 +00:00
lalala-sh	f9e48148d2	fix error log throwing	2025-07-29 14:59:18 +00:00
Feng Shijie	1b6d7cf407	crz idea	2025-07-28 08:24:51 +00:00
Feng Shijie	5473f06461	Add permuteN optimzization when NRepeat % 2 == 0 on flatmm	2025-07-27 11:57:38 +00:00
sjfeng	bfb9f4002f	try to remove c_shuffle_lds	2025-07-27 17:24:08 +08:00
Feng Shijie	1264f4d2ab	fix loop-dim mismatch and improve c_shuffle alu parallelism	2025-07-25 07:41:48 +00:00
lalala-sh	1239d8a546	merge flatmm -scale	2025-07-24 08:46:51 +00:00
lalala-sh	4066454483	revert delete of inc file	2025-07-24 16:19:58 +08:00
solin	68390988c9	reorg flatmm code	2025-07-24 04:38:16 +00:00
Feng Shijie	b908f5e803	fix flatmm syntax error on gfx950	2025-07-23 19:12:31 +00:00
Feng Shijie	5a1183ebbd	support flatmm scaling	2025-07-23 19:04:22 +00:00
valarLip	89fa639207	merge flatmm pipe v0 from dteng_flatmm_opt	2025-07-23 09:50:33 +00:00
lalala-sh	3f7d848dd3	build pass	2025-07-23 15:38:12 +08:00
lalala-sh	6dacf833da	fix bug	2025-07-23 07:20:26 +00:00
lalala-sh	7e1bd4b839	sync	2025-07-23 15:01:53 +08:00
valarLip	46a538e39e	adaptive scheduler instead of Macro definition	2025-07-23 13:40:43 +08:00
lalala-sh	9aa3396a79	fix tail handler bug	2025-07-17 08:40:35 +00:00
lalala-sh	fb76450e63	merge from dteng_flatmm_opt	2025-07-16 10:12:19 +00:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
rahjain-amd	6b09f0823e	add missing condition for bf16 (#2502 ) Without this DataType = unknown - ``` sh Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s, ``` after this change ```sh Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s, ```	2025-07-15 21:25:56 +05:30
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Gino Lu	141bf2d54d	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-14 20:35:06 +08:00

1 2 3 4 5 ...

2131 Commits