composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 12:17:00 +00:00

Author	SHA1	Message	Date
Ville Pietilä	558054eadb	WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases.	2025-09-26 13:38:24 +00:00
Ville Pietilä	8babf7195a	Fix strides in 1D conv to gemm transformation.	2025-09-26 09:38:11 +00:00
Ville Pietilä	354dd5039c	Add compile check for assumed row-mjor layout.	2025-09-26 08:39:39 +00:00
Ville Pietilä	1764c77fb2	Enable running multiple GEMM batches of merged conv groups.	2025-09-26 07:51:29 +00:00
Ville Pietilä	b864c077ed	Code clean-up for bwd tensor transformations.	2025-09-25 15:09:08 +00:00
Ville Pietilä	0ea3268d5d	Remove debug and other dead code.	2025-09-25 09:41:33 +00:00
Ville Pietilä	cc7433efc6	Add more comments, disable debug code.	2025-09-25 09:37:15 +00:00
Ville Pietilä	97f842f2c6	Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding.	2025-09-25 09:30:50 +00:00
Ville Pietilä	625a78b17b	WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding.	2025-09-24 15:08:01 +00:00
Ville Pietilä	7280df1bc3	Add one more unit test for tensor view.	2025-09-24 12:10:26 +00:00
Ville Pietilä	8048d6ff73	Fix build.	2025-09-23 11:17:08 +00:00
Ville Pietilä	e6f6c4a6a3	Working baseline for depthwise covolution with merged conv groups.	2025-09-23 11:14:10 +00:00
Ville Pietilä	29e3112b9b	Epilogue fixes.	2025-09-22 15:38:02 +00:00
Ville Pietilä	d7da3d5089	Offset fixes.	2025-09-22 15:37:46 +00:00
Ville Pietilä	7dfbac5d0b	WIP: Separate epilogue for merged conv groups.	2025-09-19 13:52:33 +00:00
Ville Pietilä	af6838e5dc	Integration test for CShuffle epilogue.	2025-09-19 12:09:08 +00:00
Ville Pietilä	7f52f84167	Fix tile window size for c block.	2025-09-19 08:08:19 +00:00
Ville Pietilä	6bcdb0947e	LDS to global memory copy.	2025-09-18 14:59:32 +00:00
Ville Pietilä	0e09504057	WIP: merged conv groups GEMM epilogue changes.	2025-09-17 14:25:02 +00:00
Ville Pietilä	27a2ceb4f7	Increase the max number of reported errors.	2025-09-17 12:29:12 +00:00
Ville Pietilä	4ec81cb95c	Add more logging.	2025-09-17 12:27:51 +00:00
Ville Pietilä	6d318ab481	Enable running multiple conv groups per batch.	2025-09-12 14:03:04 +00:00
Ville Pietilä	0d5c1b9638	WIP: Merged conv groups epilogue.	2025-09-11 15:24:36 +00:00
Ville Pietilä	970b40aa6c	WIP: Merged conv groups offset calculation.	2025-09-09 11:33:31 +00:00
Ville Pietilä	d9f0a9cdd0	Fully working conv group merging for TransformConvBwdWeightToGemm.	2025-09-09 09:58:43 +00:00
Ville Pietilä	8845b23254	WIP: Tensor transformations.	2025-09-08 15:41:54 +00:00
Ville Pietilä	61b3c96273	Add number of groups to merge to ck tile grouped gemm example.	2025-09-04 14:24:23 +00:00
Ville Pietilä	2b1908a375	Fix compilation of the grouped conv examples.	2025-09-04 12:01:49 +00:00
linqunAMD	e2d28a92af	Extend XDL kernel to Support RDNA3/4 - Part 2 (#2722 ) Update Blockwise and Gridwise files to support both wave32 & wave64. 1. Calculate WaveSize from template parameter, instead of hard code it to 64, some "64" is also replace with WaveSize 2. Move BN0Shuffled and BK0Shuffled to device side. we can't get correct mfma inst info in host side. 3. Update b_thread_offset_n and b_thread_offset_k in gridwise_gemm_xdl_cshuffle_v3_b_scale.hpp for gfx11. in gfx11, input data is duplicated for each 16 threads, it is different with all of others. 4. Modify a1_threadwise_copy in gridwise_batched_gemmgemm for gfx11. for gfx11, we need duplicate input and swizzle A if transposeC isn't enabled.	2025-09-04 08:33:40 +08:00
arai713	0282d98412	[CK TILE] Stream-K tile partitioner (#2708 ) * initial commit for skeleton code * replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code * fixed up code to match CK Tile convention: data type changes, naming changes, etc. * change for num_sk_blocks data type * formatting fix * minor fixes * moved reduction argument to template * resolved comments from PR review: standardizing naming, pruning unneeded code * resolve errors from merge of device op PR: moved enum to common file * switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems * unsigned type fix * add const qualifier * added documentation for template parameters * documentation edit	2025-09-03 13:38:17 -07:00
msaffari-amd	47d020a993	refactor: use snake_case naming in ck_tile/core components (#2766 )	2025-09-03 09:34:11 +02:00
rahjain-amd	4d041837ad	Add json dump support to output details from CK/CKTile Examples. (#2551 ) * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding json in 03_gemm * Add json dump to 16_batched_gemm * Add json dump to gemm_multi_d_fp16 * Add json dump to grouped_gemm * fix fmha_bwd/fwd * Fix clang-format errors exclude include/rapidjson in jenkins as its a third-party library * Saparating function and defination. * Update Documentation of 03_gemm * Refactoring as per code review * Disable fp8 instances on unsupported targets (#2592) * Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt * Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt * Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt * Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt --------- Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> * fix clang format * remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt * Fixing Readme and unifying jsondumps * adding moe_smoothquant * adding fused_moe * Fixing Readme for batched_gemm * Fixing Readme for grouped_gemm * adding flatmm * adding gemm_multi_d_fp16 * adding elementwise * adding File name when json is dumped * Fixing Reduce after merge * adding batched_transpose * Adding Warptile in Gemm * Fixing Clang Format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-09-02 23:31:29 -07:00
Cong Ma	e1ab460d2d	[CK TILE GEMM] Fix building issues (#2772 ) - Add `WarpGemmMfma_f32_16x16x128_[fp8\|bf8]_[fp8\|bf8]_CTransposed` - Replace `__gfx950__` with `CK_GFX950_SUPPORT`	2025-09-02 22:40:18 -07:00
linqunAMD	00fd72b2d4	Fix a typo in intrin_wmma_bf16_16x16x16_bf16_w32 (#2727 ) __builtin_amdgcn_wmma_bf16_16x16x16_bf16_w32 is only available in gfx11.	2025-09-03 08:07:09 +08:00
Po Yen Chen	9f35cde374	[CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765 ) * Fix Default2DEpilogue usage * Fix Default2DEpilogue usage for batch_prefill	2025-09-02 09:51:56 -07:00
Sami Remes	4419fc34a2	Fix formatting problem (#2768 )	2025-09-02 14:14:10 +03:00
Michael Mcminn	022f369deb	Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751 ) * Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16 * Adding support for offload target gfx9-4-generic * This duplication here isn't ideal	2025-09-02 10:35:07 +02:00
Haocong WANG	33418b201f	Fix naming issue (#2762 )	2025-09-02 11:18:53 +08:00
Po Yen Chen	d876e87fe4	[CK_TILE] Add FAv3 fwd pipeline (#2731 ) * Add FAv3 fwd pipeline * Unpack v_pk_mul to hide v_mov * Avoid compiler moving l compute across phase * Sync sched_group_barrier() setting for masking cases	2025-09-01 09:16:45 +08:00
Aviral Goel	fcff0043ae	chore(gemm): clang format to pass CI (#2758 )	2025-08-29 00:38:46 -07:00
Vijay Krish	4208e28988	ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 ) * This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. fp8, fp8 -> f32 bf8, bf8 -> f32 fp8, i4 -> f32 bf8, i4 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. * Solve merge conflict * [CK TILE] Update CHANGELOG.md --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Cong Ma <congma13@amd.com>	2025-08-28 23:43:02 -07:00
Cong Ma	428090f749	Support transposed C tile in Aquant (#2679 ) The performance of Aquant has increased after enabling transposed C. Do not need to exchange AQ elements among lanes after enabling transposed C as one thread only holds data from one row.	2025-08-28 13:28:09 -07:00
Mateusz Ozga	0758883fa4	[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752 ) * Init commit * Quick fix, CI fails * Remove CDElementWise * Add CDEELementWise --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-28 12:45:50 -07:00
asleepzzz	038ea82315	Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) This reverts commit `ead4447b20`.	2025-08-28 22:50:42 +08:00
linqunAMD	4a49dac7c6	[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728 ) * fix copy basic build error * fix other ck tile test build error	2025-08-28 20:30:30 +08:00
Yi DING	ead4447b20	[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 ) * 16x192 * Use buffer_load_lds for lse/d * Dispatch & cleanup * Avoid zeroing dq & fix * fix	2025-08-28 18:54:18 +08:00
Linjun-AMD	bf7b458e6e	use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711 ) * add k_lds padding and iglp to improve dim256 fmha fwd * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp * clang format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * use same naming style --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-28 11:39:39 +08:00
Aviral Goel	f5f795c4d6	feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691 ) * feat(check_err): add a variable to adjust number of incorrect values to print * feat(host_tensor): add printing capability for fp8 bf8 int8 int4 * fix(gemm_utils): update acceptable data type * fix(host_tensor): print both 4 bit ints in pk_int4_t * refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp * feat(host_tensor): add print first n elements functions	2025-08-27 18:17:24 -07:00
Cong Ma	cd53e2e57e	[CK TILE GEMM] Fix a merge conflict (#2753 ) * Fixed a merge conflict in `245467f3` * Foramt the code	2025-08-27 11:08:09 -07:00
Bartłomiej Kocot	cfe5e448db	Fix splitk autodeduce for grouped conv bwd weight (#2742 )	2025-08-27 12:35:42 +02:00

1 2 3 4 5 ...

1035 Commits