composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 12:17:00 +00:00

Author	SHA1	Message	Date
Ding, Yi	85ac576109	Merge gemm_mx_common.hpp	2025-05-27 06:13:03 +00:00
Ding, Yi	123053b685	Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4	2025-05-27 03:36:38 +00:00
Ding, Yi	91eb136937	Fix v1; use M padding	2025-05-26 10:32:26 +00:00
Andriy Roshchenko	f03da29b65	Merge branch origin/wip-f4 into andriy/wip-f4	2025-05-23 22:14:30 +00:00
aska-0096	574d65efed	temp save	2025-05-23 14:51:24 +00:00
aska-0096	7f7c4d35c7	lds conflict free + buffer load lds	2025-05-22 08:04:52 +00:00
Andriy Roshchenko	e302ab8f0c	Merge branch origin/develop into wip-fp4	2025-05-22 06:31:47 +00:00
Ding, Yi	352542c49e	Better kernel selection in device classes	2025-05-22 06:05:10 +00:00
aska-0096	e1084fe7d6	tempsave. compile pass, function wrong	2025-05-20 10:57:26 +00:00
aska-0096	f3a296bad4	lds conflict free + buffer load lds	2025-05-19 09:40:39 +00:00
aska-0096	e2c8f98fef	generalize the pipeline scheduling.	2025-05-19 02:29:02 +00:00
aska-0096	3e8b07ef58	tempsave; modify the way we represent fp4	2025-05-19 02:28:23 +00:00
arai713	5b3430b868	Narrowing error fix for codegen compilation (#2194 ) * removed comment with special characters * fix for arg/template change after merge from develop --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-05-16 11:11:54 -07:00
aska-0096	248e287866	generalize the pipeline scheduling.	2025-05-16 10:41:59 +00:00
aska-0096	a0379d81e7	modify the way we represent fp4	2025-05-16 09:44:04 +00:00
Mateusz Ozga	fa3c6811d8	Disable conv for Filter1x1Stride1Pad0 when K or C is even (#2186 )	2025-05-16 10:18:47 +02:00
aska-0096	a1bec7670a	tempsave	2025-05-16 08:14:56 +00:00
Ding, Yi	c04d44b5f6	Merge remote-tracking branch 'origin/develop' into wip-f4	2025-05-16 07:11:26 +00:00
Bartłomiej Kocot	c53b7bd22e	Switch to v2 pipeline for grouped conv bwd data (#2181 ) * Change to old pipeline for grouped conv bwd data * fix * fix * fix * fix * fix * fix * Fix	2025-05-13 10:14:30 +02:00
aska-0096	79246e6cb8	function pass with inline asm hacky	2025-05-12 16:54:44 +00:00
Ding, Yi	4b19b934e8	fix fp8; fix even/odd	2025-05-12 07:31:28 +00:00
aska-0096	41ea1066ac	implement shuffled scale mxfp4gemm, blocker: opsel not effect	2025-05-11 05:54:13 +00:00
aska-0096	6c761bf9b8	tempsave; buggy at passed 4 e8m0 to scaled mfma	2025-05-10 09:57:49 +00:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
aska-0096	087b20dc1d	clang format	2025-05-09 16:15:10 +00:00
jefyang1	6b1a339b6f	Fix grouped conv bwd data tests on gfx950 (#2173 )	2025-05-09 09:01:06 -07:00
aska-0096	86bff9c46e	Fix pipe v3 correctness issue	2025-05-09 15:54:43 +00:00
aska-0096	7bde4b8d34	Add pipeline v3. Have some runtime issue and register spill	2025-05-09 09:47:22 +00:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Bartłomiej Kocot	4094ad158a	Integrate universal gemm with conv bwd data and add SplitK (#1315 ) * Integrate universal gemm with conv bwd data * Fix multi d kernel * Add splitK support * instances refactor * instances refactor * refactor * fixeS * fixes * 16x16 instnaces * Fixes * Fix * Fix * Fix * Fix * Fix * Fixes * fix * fix	2025-04-28 23:54:49 +02:00
Anton Gorenko	edd92fc546	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on `e5ad48a784` * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com>	2025-04-28 10:14:21 +05:00
lalala-sh	39ba03f25d	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-04-23 10:35:34 +08:00
lalala-sh	bcf5bb41be	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	213b203a3c	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation	2025-04-16 19:56:00 -06:00
Andriy Roshchenko	da54464cce	MX GEMM - Add MX BF8 example (#2071 ) * Add MX GEMM example for MX BF8 * Verified MX FP8 with 16x16x128 scale builtin * Verify MX BF8 GEMM with BF16 output	2025-04-16 15:25:02 -06:00
Andriy Roshchenko	7106976a72	MX GEMM - New GEMM pipeline for MX data types (#2059 ) * Allow selection of mfma_scale instructions * Read B tensor from LDS to VGPR in chunks of 16 in MFMA order * Add constexpr and synchronize return type for `get_exponent_value` * Pass scales by reference and add comments to `mfma_scale_f32_32x32x64` * Add support for microscaling instructions in `XdlopsGemm` * Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper * Remove software implementation of MX GEMM * Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction * Update README * Updated CHANGELOG * Remove unused static methods	2025-04-15 17:17:07 -06:00
Mingtao Gu	56378f810f	CK pk_i4_t test failures fix (SWDEV-518629) (#2075 ) * fix pk_i4_v3 tests failures in Unbuntu env. * fix pk_i4_t tests failure on Unbuntu issues. * some fixed. --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-04-14 16:58:57 +08:00
Illia Silin	0d4f145078	Fix build issues for multiple targets. (#2077 ) * build for multiple targets on gfx942 * add missing ignore statements	2025-04-11 12:12:53 -07:00
Illia Silin	572cd820ce	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic	2025-04-03 15:30:21 -07:00
aledudek	9329432f6c	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments	2025-04-03 13:35:43 +02:00
Bartłomiej Kocot	2ccf914888	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix	2025-04-02 23:59:49 +02:00
Adam Osewski	e5ad48a784	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-04-02 11:03:40 +02:00
Bartłomiej Kocot	8c0ab61ece	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	ec742908bd	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test	2025-04-01 13:19:35 -07:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	6660dc6b8e	Introduce MX GEMM for FP8 data type (#2000 )	2025-03-24 15:41:07 -06:00
aledudek	5095906975	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00
Illia Silin	4c97cc511e	use old instrinsics with staging compiler (#1970 )	2025-03-12 07:29:09 -07:00
Haocong WANG	cbd74c2d12	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-11 10:11:21 -07:00

1 2 3 4 5 ...

302 Commits