composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 21:21:22 +00:00

Author	SHA1	Message	Date
Feng Shijie	85976b0b87	use int64_t as expert stride to avoid overflow	2025-08-21 06:58:55 +00:00
Feng Shijie	9fbcc8f8a4	use v4i32 as the storage type for B to avoid repack operation	2025-08-20 13:53:32 +00:00
Feng Shijie	c27eb0771a	optimize cvt_pkf4_to_f16 implementation	2025-08-20 04:39:14 +00:00
Feng Shijie	3ca0bd500a	optimize A_LDS descriptor to avoid bankconflict	2025-08-19 14:56:46 +00:00
Feng Shijie	f7f0306eea	fix gate-up when GU_NRepeat > 1	2025-08-18 18:43:37 +00:00
Feng Shijie	be55c0f9cb	add fp16xf4 moe	2025-08-18 17:28:11 +00:00
Feng Shijie	599e1f5b32	rename example	2025-08-17 17:51:18 +00:00
Feng Shijie	7899fb4a8d	remove additional check when e8m0->float	2025-08-15 06:20:46 +00:00
Feng Shijie	714b341797	eliminate repeat dequant	2025-08-14 09:34:12 +00:00
Feng Shijie	53e8c0c533	Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm	2025-08-13 16:51:49 +00:00
Feng Shijie	5de6208952	update f16xMXF4	2025-08-13 16:16:48 +00:00
Feng Shijie	732ebdee8b	update scale-preshuffle for MXF4	2025-08-13 10:48:53 +00:00
Feng Shijie	edb58d0680	update	2025-08-11 11:24:34 +00:00
Feng Shijie	cc9c7b9e58	optimize gemm2 atomic_add pattern	2025-08-11 08:38:23 +00:00
Feng Shijie	200a11afc8	update scale for mxfp4	2025-08-11 07:59:47 +00:00
Feng Shijie	f788d3d629	add mixed_prec fp16xfp4	2025-08-08 20:19:16 +00:00
Feng Shijie	3dea10a277	debug mixed_prec flatmm	2025-08-07 09:22:04 +00:00
Feng Shijie	6d3cbc7c0e	add moe_flatmm	2025-08-06 08:33:33 +00:00
coderfeli	c0cb4d036d	fix split k	2025-08-06 02:45:31 +00:00
Feng Shijie	90e910f3a7	fix flatmm with scaling when WarpTileM == 32	2025-08-04 07:16:36 +00:00
Feng Shijie	aa5e008fa5	optimize scaling epilogue	2025-08-01 11:01:23 +00:00
Feng Shijie	3f43b841d4	prune debug message	2025-07-30 06:37:26 +00:00
Feng Shijie	2e5d4c74cd	fix compile error	2025-07-30 04:52:08 +00:00
Feng Shijie	c117a1986a	Add persistent option on flatmm for tuning	2025-07-29 15:42:58 +00:00
AMD-dteng	a587701117	update pipeline v1: add atomic IGLP schedule	2025-07-29 14:59:32 +00:00
Feng Shijie	1b6d7cf407	crz idea	2025-07-28 08:24:51 +00:00
Feng Shijie	5473f06461	Add permuteN optimzization when NRepeat % 2 == 0 on flatmm	2025-07-27 11:57:38 +00:00
sjfeng	bfb9f4002f	try to remove c_shuffle_lds	2025-07-27 17:24:08 +08:00
Feng Shijie	1264f4d2ab	fix loop-dim mismatch and improve c_shuffle alu parallelism	2025-07-25 07:41:48 +00:00
lalala-sh	1239d8a546	merge flatmm -scale	2025-07-24 08:46:51 +00:00
Feng Shijie	b908f5e803	fix flatmm syntax error on gfx950	2025-07-23 19:12:31 +00:00
Feng Shijie	5a1183ebbd	support flatmm scaling	2025-07-23 19:04:22 +00:00
valarLip	89fa639207	merge flatmm pipe v0 from dteng_flatmm_opt	2025-07-23 09:50:33 +00:00
lalala-sh	3f7d848dd3	build pass	2025-07-23 15:38:12 +08:00
lalala-sh	6dacf833da	fix bug	2025-07-23 07:20:26 +00:00
lalala-sh	7e1bd4b839	sync	2025-07-23 15:01:53 +08:00
valarLip	46a538e39e	adaptive scheduler instead of Macro definition	2025-07-23 13:40:43 +08:00
lalala-sh	9aa3396a79	fix tail handler bug	2025-07-17 08:40:35 +00:00
lalala-sh	fb76450e63	merge from dteng_flatmm_opt	2025-07-16 10:12:19 +00:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
Qianfeng	45904b8fd7	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com>	2025-07-11 18:14:47 +08:00
shay-li77	d814fefe18	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask	2025-07-09 23:18:55 +08:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
Po Yen Chen	b2dea90116	Eliminate warning caused by failed to meet occupancy requirement (#2389 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-08 09:17:25 +08:00
Thomas Ning	f240ae3248	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com>	2025-07-07 10:08:49 -07:00
ltqin	9f4c5d7372	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-07 16:16:54 +08:00
Thomas Ning	e03293ebce	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>	2025-06-25 08:20:35 -07:00
linqunAMD	37e1a27537	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-25 01:07:45 -07:00

1 2 3 4 5 ...

251 Commits