composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Wen.Yang	0264cc4c55	dQ NaN fix and hd192 group mode	2025-07-10 11:21:56 +08:00
ipanfilo	389005af64	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) (#2484 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>	2025-07-10 08:52:49 +08:00
yaomingamd	4f187e1009	Merge pull request #2426 from yaomingamd/ck_tile/fa_bwd_v3 remove restriction of group mode hd192 no lse( cherry-pick `57f49745` to support deepseek MLA) (#2252)	2025-07-01 15:30:24 -05:00
slippedJim	2639a6bc62	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-06-30 16:20:39 +00:00
Ye Wang	5d246c0b3c	fix the buffer intrinsic names for clang >=20 (cherry-pick `8146e47`) (#2412 ) Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>	2025-06-27 12:23:38 +08:00
amd-ruitang3	efdf31e26f	revert incorrect operations in bwd generation	2025-06-10 08:25:07 +00:00
Wen.Yang	160788cdf4	update benchmark script	2025-05-29 15:17:41 +08:00
Wen.Yang	efaa64e492	update bwd v3 kernels for timing error fixed	2025-05-29 15:11:16 +08:00
slippedJim	e4f0c4a549	[WIP] enable hd128 swa (#2137 ) * enable hd128 swa	2025-05-08 21:00:37 +08:00
slippedJim	d0028193fc	[WIP] update hd64 group hex & explicit hd128 in kernel name (#2118 ) * remove all pssk kernels * update: update pssk kernels * update * update * fix kernel name * fix: update initial version hex * add shell scripts of smoke test and benchmark test for group mode * fix typo --------- Co-authored-by: Wen.Yang <Wen.Yang@example.com>	2025-04-24 10:16:13 +08:00
slippedJim	2710b61b78	enable causal/no_causal case (#2111 )	2025-04-22 09:11:27 +08:00
wen-des	a115ab14ed	group mode attetnion for generic 64<hd<=128 (#2092 ) * group mode attetnion for generic 64<hd<=128 * fix loop_idx overridden before kvBase in causal mode * align format --------- Co-authored-by: Wen.Yang <Wen.Yang@example.com> Co-authored-by: danyao12 <danyao12@amd.com>	2025-04-20 11:11:11 +08:00
danyao12	20a250440a	atomic_add exec	2025-04-14 17:01:08 +08:00
danyao12	cf964e9982	fix group seqlen_k >> seqlen_q/causal bug	2025-04-12 12:10:21 +08:00
danyao12	9b7c18bec5	fix seqlen_k >> seqlen_q/causal bug	2025-04-12 10:31:04 +08:00
danyao12	03b75b3f97	hd192 causal temp version	2025-04-11 15:22:32 +08:00
danyao12	4c07813b46	hd192 bf16 rtne&rtz	2025-04-08 15:21:13 +08:00
danyao12	7949a5adf8	hd192 temp version	2025-04-05 14:50:43 +08:00
danyao12	ab5f40cfaf	enable SBHD	2025-03-24 17:58:01 +08:00
danyao12	bbb526b9d2	support for more layouts	2025-03-24 15:10:33 +08:00
danyao12	5e0ebbe8b6	fix hd64 group mode causal api	2025-03-23 12:43:25 +08:00
danyao12	917f7e55f8	Merge branch 'develop' into ck_tile/fa_bwd_v3	2025-03-22 20:38:04 +08:00
danyao12	6758b77802	fix random memory access fault	2025-03-22 20:03:28 +08:00
carlushuang	6c08c5c46d	add mask support in hdim=192/128 (#1999 )	2025-03-21 18:28:43 +08:00
BingYuan.Zhou	5a0d693b86	fix ck_tile/basic_gemm build error (#1988 )	2025-03-20 22:01:14 -07:00
wen-des	8a25aa2669	support group mode for hd=64 of fa bwd v3 (#1990 ) * support group mode for hd=64 of fa bwd v3 * bugfixed for causal mask kernels when using kernel balence * tiny align --------- Co-authored-by: Wen.Yang <Wen.Yang@example.com> Co-authored-by: danyao12 <danyao12@amd.com>	2025-03-21 11:42:58 +08:00
felix	902dbe89ad	change cmake (#2006 ) Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-20 19:25:11 -07:00
Attila T. Áfra	c79bf11148	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8')	2025-03-20 12:37:25 -07:00
carlushuang	e3c9886cdf	[CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982 ) * return value with macro and revert the return value * [CK-TILE] no-macro launch api solution (#1992) * no-macro solution * address -Wcomma --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-03-20 11:00:29 -07:00
jakpiase	0e91d32c61	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests	2025-03-20 11:17:04 +01:00
rocking	b819c217e4	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-20 00:06:45 +08:00
felix	7eaedeb36c	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-03-19 22:58:27 +08:00
Bartłomiej Kocot	fdaff5603e	Add grouped conv bwd wei merged grouped instance for larger filter (#1984 ) * Add grouped conv bwd wei merged grouped instance for larger filter * Update readme	2025-03-18 16:16:24 +01:00
danyao12	e80ff1acbb	tiny fix	2025-03-18 12:08:34 +08:00
Illia Silin	1342ecf7fb	Add a daily CI build on gfx908. (#1987 ) * add one daily ci build on gfx908 * add redis invocation tag for gfx908 * make ci build for gfx908 conditional * fix groovy logic * add option to run perf tests for gfx908 * disable a few tests on mi100	2025-03-17 18:08:53 -07:00
Illia Silin	07f25186b2	disable ck_tile basic gemm (#1986 )	2025-03-17 15:26:43 -07:00
aledudek	5095906975	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00
valarLip	52b1cd7780	hotfix fmoe build issue (#1976 )	2025-03-13 15:11:59 +08:00
dependabot[bot]	de7a745ca6	Bump rocm-docs-core from 1.17.1 to 1.18.1 in /docs/sphinx (#1977 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.1...v1.18.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-12 23:36:36 -07:00
carlushuang	3e81279d26	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `8cbcd3e0d0`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-13 11:41:39 +08:00
Illia Silin	d4a6d69643	disable tests that take too long to build for gfx90a (#1975 )	2025-03-12 17:54:03 -07:00
feli	251afab3b7	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-12 09:22:42 -07:00
Illia Silin	4c97cc511e	use old instrinsics with staging compiler (#1970 )	2025-03-12 07:29:09 -07:00
Illia Silin	8cbcd3e0d0	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `7a93b16ff6`.	2025-03-11 10:40:18 -07:00
Haocong WANG	cbd74c2d12	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-11 10:11:21 -07:00
Haocong WANG	ba209b9dab	reduce test size to avoid timeout on specific silicon (#1966 )	2025-03-11 09:15:26 -07:00
Illia Silin	aa42c3db06	disable example_moe_gemm2_xdl_pk_i4 on gfx950 (#1968 )	2025-03-11 08:34:47 -07:00
carlushuang	7a93b16ff6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update	2025-03-11 21:07:40 +08:00
wenchenvincent	489602f9a8	Enabled bwd support for hdim_qk != hdim_v for TE integration. (#1965 )	2025-03-11 11:35:27 +08:00

1 2 3 4 5 ...

1838 Commits