composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Author	SHA1	Message	Date
lalala-sh	2d0b5aba13	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete [ROCm/composable_kernel commit: `bcf5bb41be`]	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	6d0890b6f4	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation [ROCm/composable_kernel commit: `213b203a3c`]	2025-04-16 19:56:00 -06:00
Andriy Roshchenko	348760d56e	MX GEMM - Add MX BF8 example (#2071 ) * Add MX GEMM example for MX BF8 * Verified MX FP8 with 16x16x128 scale builtin * Verify MX BF8 GEMM with BF16 output [ROCm/composable_kernel commit: `da54464cce`]	2025-04-16 15:25:02 -06:00
BingYuan.Zhou	4ec293cb4b	[flatmm] implement basic fp16 flatmm (#2089 ) * [flatmm] implement basic fp16 flatmm * fix CI build fail --------- Co-authored-by: root <root@hjbog-srdc-50.amd.com> Co-authored-by: solin <bingzhou@amd.com> [ROCm/composable_kernel commit: `eaf1f0bf3b`]	2025-04-16 16:51:17 +08:00
Andriy Roshchenko	d9c9f17c3d	MX GEMM - New GEMM pipeline for MX data types (#2059 ) * Allow selection of mfma_scale instructions * Read B tensor from LDS to VGPR in chunks of 16 in MFMA order * Add constexpr and synchronize return type for `get_exponent_value` * Pass scales by reference and add comments to `mfma_scale_f32_32x32x64` * Add support for microscaling instructions in `XdlopsGemm` * Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper * Remove software implementation of MX GEMM * Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction * Update README * Updated CHANGELOG * Remove unused static methods [ROCm/composable_kernel commit: `7106976a72`]	2025-04-15 17:17:07 -06:00
Mingtao Gu	e8db9f0220	CK pk_i4_t test failures fix (SWDEV-518629) (#2075 ) * fix pk_i4_v3 tests failures in Unbuntu env. * fix pk_i4_t tests failure on Unbuntu issues. * some fixed. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `56378f810f`]	2025-04-14 16:58:57 +08:00
Thomas Ning	1b61d3a0ed	Solve the Static Encoding Pattern compile error when the tile size is too small (#2079 ) [ROCm/composable_kernel commit: `269f4f6af5`]	2025-04-13 20:09:30 -07:00
Illia Silin	90612d0e37	Fix build issues for multiple targets. (#2077 ) * build for multiple targets on gfx942 * add missing ignore statements [ROCm/composable_kernel commit: `0d4f145078`]	2025-04-11 12:12:53 -07:00
jakpiase	d76ebf9795	[CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957 ) * add structured sparsity fp16 support for gemm * added reviewer suggestions * update changelog * update changelog * add reviewers suggestions * Minor fix * clang fix * fix doxygen [ROCm/composable_kernel commit: `6c61f4d237`]	2025-04-11 12:18:26 +02:00
slippedJim	959225947a	add fmha fwd splitkv receipt for aiter c++ api (#2068 ) * add s_randval for c++ api * Fix bug of bias in splitkv --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `5f885d2b7a`]	2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño	7a42b06988	Replace inline assembly with builtins in FHMA (#2067 ) * Replace inline assembly with builtins in FHMA --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `f14e648e7c`]	2025-04-10 09:48:37 +02:00
Illia Silin	7546e4bafe	enable gfx115x support (#2065 ) [ROCm/composable_kernel commit: `3e6d21adeb`]	2025-04-09 10:06:42 -07:00
MHYang-gh	62ce5b906b	Make buffer coherence configurable in tensor view (#2041 ) * Make buffer coherence configurable in tensor view * Fix clang-format for tensor_view.hpp [ROCm/composable_kernel commit: `03ce8729fd`]	2025-04-08 15:34:11 -07:00
valarLip	c1d067be5c	add passthrough for int32->float32 (#2062 ) [ROCm/composable_kernel commit: `2c563fecf7`]	2025-04-08 15:16:30 -07:00
Max Podkorytov	26724086f3	simplify generate_tuple (#2043 ) [ROCm/composable_kernel commit: `6ce0797dad`]	2025-04-08 09:00:51 -07:00
aledudek	6dbaeb5fe8	[CK_TILE] Fix GEMM Memory Pipeline (#2034 ) * [CK_TILE] Fix GEMM Memory Pipeline * Fix transpose tile * Add comments [ROCm/composable_kernel commit: `80aae6119b`]	2025-04-08 12:40:04 +02:00
Illia Silin	32879114dc	fix codegen issues (#2052 ) [ROCm/composable_kernel commit: `1793228422`]	2025-04-07 07:08:39 -07:00
Illia Silin	ada1b5f341	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic [ROCm/composable_kernel commit: `572cd820ce`]	2025-04-03 15:30:21 -07:00
Rostyslav Geyyer	7fbc128e83	Add FP16/BF16<->FP8/BF8 conversions (#2035 ) * Move conversion functions and add missing conversions * Add tests * Add missing conversions * Add missing conversions * Add bf8 tests * Update clipping for vectors * Add missing conversions * Add bf16 fp8 tests * Add bf16 bf8 tests * Fix device conversion * Fix conversions * Fix vector use * Minor fix * Add a workaround flag * Add a workaround flag for bf16 conversion * Add another workaround * Add a workaround for fp16 to bf8 conversion * Update type alias * Add docstrings and missing wrappers * Fix if defined macros * Fix more if defined macros * Add comments * Remove __host__ specifier * Add a gfx950 guard * Update function naming [ROCm/composable_kernel commit: `265af71a71`]	2025-04-03 12:42:03 -05:00
aledudek	b7359bcfac	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments [ROCm/composable_kernel commit: `9329432f6c`]	2025-04-03 13:35:43 +02:00
Bartłomiej Kocot	169e3cb4f8	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix [ROCm/composable_kernel commit: `2ccf914888`]	2025-04-02 23:59:49 +02:00
Adam Osewski	5585c3121e	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `e5ad48a784`]	2025-04-02 11:03:40 +02:00
Bartłomiej Kocot	ca7ae808d4	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances [ROCm/composable_kernel commit: `8c0ab61ece`]	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	67c3bcfce1	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test [ROCm/composable_kernel commit: `ec742908bd`]	2025-04-01 13:19:35 -07:00
Seunghoon Lee	345ab65612	Fix Windows build. (#2012 ) * Remove duplicate using uint64_t. * Cast before shift. [ROCm/composable_kernel commit: `df32020f93`]	2025-04-01 12:22:10 -07:00
Max Podkorytov	cf08db04a6	add a fast compilation path for static for (0..N) (#2005 ) * add a fast compilation path for static for (0..N) * Update functional2.hpp add comment and put range applier into detail namespace * Update functional.hpp ditto for ck-tile * prettify * prettify more * add comment * clang-format [ROCm/composable_kernel commit: `c59a8bb206`]	2025-04-01 12:06:25 -07:00
rocking	01ea8aa249	Reduce redundant space in bias tensor (#2024 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `8a20b62e91`]	2025-03-28 21:58:06 +08:00
felix	20ffa0f474	hotfix fix sorting int64 (#2025 ) * fix sorting int64 * clang format * fix example issue * update WA issue # --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `a82f338fb9`]	2025-03-28 11:31:52 +08:00
Rostyslav Geyyer	23ad59e1fd	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround [ROCm/composable_kernel commit: `441343a23d`]	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	6ccfb817e4	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes [ROCm/composable_kernel commit: `54c81a1fcf`]	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	75ef4c83bf	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales [ROCm/composable_kernel commit: `72d888821c`]	2025-03-25 15:33:03 -06:00
Max Podkorytov	58789d03d3	use fast path for sequence generation in old CK (#1993 ) [ROCm/composable_kernel commit: `1a58522f01`]	2025-03-25 11:28:44 -07:00
ruanjm	ce1d20c2c6	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx [ROCm/composable_kernel commit: `d49abdaa87`]	2025-03-25 20:09:45 +08:00
Illia Silin	b9e0e7d93e	Split up data_type header. (#1996 ) * split fp64 vector data type * add missing header * move e8m0 structs * split off numeric_utils header * fix typo * split off numeric limits header * update data_type header * fix clang format * split off vector type header * fix clang format * fix typo for binary_inf [ROCm/composable_kernel commit: `d2eab23958`]	2025-03-24 15:08:54 -07:00
Andriy Roshchenko	bbdd7f6d57	Introduce MX GEMM for FP8 data type (#2000 ) [ROCm/composable_kernel commit: `6660dc6b8e`]	2025-03-24 15:41:07 -06:00
MHYang-gh	fd151c05d9	Fix A/B lds transform (#2007 ) [ROCm/composable_kernel commit: `c027637a8f`]	2025-03-22 23:13:50 -07:00
Bartłomiej Kocot	ceb078163f	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments [ROCm/composable_kernel commit: `5b0873c31a`]	2025-03-22 23:19:49 +01:00
BingYuan.Zhou	c245d569d5	fix ck_tile/basic_gemm build error (#1988 ) [ROCm/composable_kernel commit: `5a0d693b86`]	2025-03-20 22:01:14 -07:00
Attila T. Áfra	081e3c7880	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8') [ROCm/composable_kernel commit: `c79bf11148`]	2025-03-20 12:37:25 -07:00
carlushuang	23340c5dd5	[CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982 ) * return value with macro and revert the return value * [CK-TILE] no-macro launch api solution (#1992) * no-macro solution * address -Wcomma --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `e3c9886cdf`]	2025-03-20 11:00:29 -07:00
jakpiase	f1262b783a	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests [ROCm/composable_kernel commit: `0e91d32c61`]	2025-03-20 11:17:04 +01:00
rocking	b0f323c4ec	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `b819c217e4`]	2025-03-20 00:06:45 +08:00
felix	c2948a0634	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7eaedeb36c`]	2025-03-19 22:58:27 +08:00
aledudek	73d207bd4e	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue [ROCm/composable_kernel commit: `5095906975`]	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	b8f58a234e	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example [ROCm/composable_kernel commit: `c2e4898b4b`]	2025-03-17 13:32:00 +01:00
carlushuang	f2dd57b76f	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `b92caa3d84`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e81279d26`]	2025-03-13 11:41:39 +08:00
feli	e3c5b2ae80	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `251afab3b7`]	2025-03-12 09:22:42 -07:00
Illia Silin	a7614ad594	use old instrinsics with staging compiler (#1970 ) [ROCm/composable_kernel commit: `4c97cc511e`]	2025-03-12 07:29:09 -07:00
Illia Silin	b92caa3d84	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `45fbd9210a`. [ROCm/composable_kernel commit: `8cbcd3e0d0`]	2025-03-11 10:40:18 -07:00
Haocong WANG	1ed0b74c43	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `cbd74c2d12`]	2025-03-11 10:11:21 -07:00

1 2 3 4 5 ...

745 Commits