composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
aska-0096	5616551115	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-07-18 05:17:27 +00:00
aska-0096	ae39c84f55	tempsave	2025-07-18 05:16:39 +00:00
Mateusz Ozga	7fc000d7b3	Fix CI clang-format (#2521 )	2025-07-17 14:41:29 +02:00
aska-0096	94b6430489	temp save	2025-07-17 10:06:09 +00:00
aska-0096	7e330553dc	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline	2025-07-17 07:24:32 +00:00
Haocong WANG	28072adc3a	fix mfma32x32 dispatch (#2490 )	2025-07-17 15:24:12 +08:00
Yi DING	f1d8ad2818	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>	2025-07-16 23:56:22 -07:00
Po Yen Chen	722c22fb15	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit `b2dea90116`.	2025-07-17 10:09:01 +08:00
linqunAMD	fbd9f32abe	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com>	2025-07-17 08:19:57 +08:00
Max Podkorytov	21fd7e9538	Merge branch 'develop' into test_copy_fix	2025-07-16 11:23:57 -07:00
linqunAMD	6e76b82059	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-16 07:58:23 -07:00
Illia Silin	a4bf78ac0e	replace obsolete warpSize system variable with the new one (#2496 )	2025-07-16 07:39:15 -07:00
aska-0096	d6df7bf851	fix vmcnt shift	2025-07-16 08:55:50 +00:00
aska-0096	40e039e4e4	Improve s_waitcnt_imm calculation	2025-07-16 08:37:07 +00:00
huaiguxu	c1badfd30c	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-16 15:44:34 +08:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
aska-0096	c30f8b709b	fix the s_waitcnt_imm calculation	2025-07-16 05:39:50 +00:00
aska-0096	e5cc4af808	Add block_sync_lds_direct_load utility	2025-07-16 03:54:33 +00:00
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Gino Lu	141bf2d54d	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-14 20:35:06 +08:00
Andriy Roshchenko	518dc21ae8	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-11 13:07:05 -06:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
Qianfeng	45904b8fd7	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com>	2025-07-11 18:14:47 +08:00
Illia Silin	1b66f3f4a3	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting	2025-07-10 07:18:56 -07:00
aska-0096	18669925cc	temp save, change all instance to 1wave	2025-07-10 04:29:33 +00:00
shay-li77	d814fefe18	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask	2025-07-09 23:18:55 +08:00
Yi DING	032ca60015	[CK_TILE] Avoid compile kernel in host pass (#2475 )	2025-07-09 22:27:54 +08:00
Illia Silin	93420ecf89	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `112b47e885`.	2025-07-08 19:01:26 -07:00
Illia Silin	112b47e885	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short	2025-07-08 18:09:30 -04:00
aska-0096	18686cfe5b	tempsave, fmha_decode	2025-07-08 08:37:20 +00:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
Illia Silin	e033a1b4bf	fix compilation errors with clang20 (#2464 )	2025-07-07 19:40:30 -07:00
Po Yen Chen	b2dea90116	Eliminate warning caused by failed to meet occupancy requirement (#2389 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-08 09:17:25 +08:00
Thomas Ning	f240ae3248	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com>	2025-07-07 10:08:49 -07:00
Andriy Roshchenko	054f85ab7c	MX GEMM - FP6 Example (#2419 ) Adds support for MX FP6 data type in MX GEMM block pipeline version v1. Provides an example of MX FP6 GEMM algorithm. --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com>	2025-07-07 10:33:26 -06:00
ltqin	9f4c5d7372	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-07 16:16:54 +08:00
carlushuang	0aecb5ab68	default skip y point to r (#2457 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-06 23:54:34 -07:00
carlushuang	a8742f7e31	[CK_TILE][CORE] enhance slice_tile api (#2430 ) * support slice cross p * fix some bug in y_len * more case * fix a bug when R exist * support -1 to hint end of current length * format * change commit	2025-07-06 20:13:12 -07:00
Mingtao Gu	7998ae8969	[CK] Mxfp4 moe blockscale buf2lds version support (#2455 ) * change cshuffle size * added mxfp4 moe async buffer loading without B preshuffle * added mx moe B shuffling + scale shuffling (async loads) * minor fix --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-07-06 15:42:00 +08:00
Adam Osewski	3d70c638d1	Always force output clearing for grouped conv bwd data (#2446 ) * Always force output clearing * dont run set zero for residual --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-04 07:49:52 -06:00
Max Podkorytov	158ddeb8ce	[CK-TILE] File-level documentation for static encoding pattern (#2433 ) * add file-level comment * Finished the write-up --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-04 02:26:18 -07:00
Vidyasagar Ananthan	2e971eff90	Removing reference to undefined parameter for ignore statement. (#2447 )	2025-07-03 20:10:29 -07:00
damien-lejeune	1183824573	Fix clang in ck develop branch (#2445 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2025-07-02 10:07:47 -06:00
chenjun	74a34e0f50	fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437 ) Co-authored-by: valarLip <340077269@qq.com>	2025-07-02 19:12:07 +08:00
Gino Lu	60eb70f543	Fix return value bug that drops minus sign in some cases. (#2415 ) * fix return value bug. * refine change according to comment.	2025-07-02 14:53:00 +08:00
huaiguxu	e1c5172fdb	Huaiguxu/moe fp8 pertoken scale fix (#2391 ) * fix pertoken_scale a_scale dimension * clang-format * Fix moe_gemm2_fp8 perTokenScale reference and example.	2025-06-27 10:24:34 +08:00
linqunAMD	1749c0409e	[CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375 ) 1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it. 2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance. 3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.	2025-06-26 08:32:39 +08:00
Thomas Ning	e03293ebce	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>	2025-06-25 08:20:35 -07:00
Rostyslav Geyyer	daf71fb8e4	Enable fp4 tests (#2329 )	2025-06-25 07:38:54 -05:00
linqunAMD	37e1a27537	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-25 01:07:45 -07:00

1 2 3 4 5 ...

922 Commits