composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-06 15:54:31 +00:00

Author	SHA1	Message	Date
lalala-sh	def952a178	use mem_op::set when topk=1	2025-05-08 09:49:16 +00:00
lalala-sh	960b2bce1c	update	2025-05-08 09:48:23 +00:00
lalala-sh	0ab978584d	fix bugs	2025-05-06 07:36:59 +00:00
lalala-sh	9c06c3817a	[fix] align v3 gufusion pipeline	2025-04-30 02:27:39 +00:00
lalala-sh	b8427b812e	align v3 gufusion pipeline	2025-04-30 01:35:49 +00:00
aska-0096	bc9c819aa4	generalized bpreshuffle pipeline optimization	2025-04-27 11:50:30 +00:00
aska-0096	49338edb1b	tempsave	2025-04-27 08:05:20 +00:00
aska-0096	1637dd5297	Merge branch 'swdev_528812_moe' of https://github.com/ROCm/composable_kernel into swdev_528812	2025-04-25 05:14:01 +00:00
aska-0096	946a2119cd	temp save	2025-04-25 05:12:47 +00:00
coderfeli	f9c29b5ec7	set 16x16	2025-04-25 03:09:53 +00:00
coderfeli	cd7955ce8b	Merge branch 'swdev_528812' into dev/moe_opt	2025-04-25 02:51:33 +00:00
coderfeli	ddb5f36eeb	add missing file	2025-04-24 11:10:31 +00:00
coderfeli	ceaa5a984b	gu fusion v3	2025-04-24 11:05:08 +00:00
coderfeli	2054e165bc	fix moe pipeline and change to compute tile	2025-04-24 06:26:12 +00:00
aska-0096	d6e2dd92fe	enable f8 new mfma for preshuffle gemm. found some vectorizer issue even with slp flag	2025-04-23 15:53:15 +00:00
lalala-sh	39ba03f25d	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-04-23 10:35:34 +08:00
Rostyslav Geyyer	416e851584	Temporarily disable MX FP4 device tests (#2112 )	2025-04-22 16:08:48 -05:00
aska-0096	5366d3415b	f8 mfma issue	2025-04-22 10:59:03 +00:00
Thomas Ning	0cca8fa28f	GEMM Multiply Multiply Fix (#2102 ) * fix the type convert and increase the BF16 conversion + the profile comment * fix the CI	2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk	b092c18da7	MI308 fix for streamk 1-Tile floating point exception (#2101 )	2025-04-21 11:44:07 -07:00
lalala-sh	bcf5bb41be	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	213b203a3c	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation	2025-04-16 19:56:00 -06:00
Andriy Roshchenko	da54464cce	MX GEMM - Add MX BF8 example (#2071 ) * Add MX GEMM example for MX BF8 * Verified MX FP8 with 16x16x128 scale builtin * Verify MX BF8 GEMM with BF16 output	2025-04-16 15:25:02 -06:00
Andriy Roshchenko	7106976a72	MX GEMM - New GEMM pipeline for MX data types (#2059 ) * Allow selection of mfma_scale instructions * Read B tensor from LDS to VGPR in chunks of 16 in MFMA order * Add constexpr and synchronize return type for `get_exponent_value` * Pass scales by reference and add comments to `mfma_scale_f32_32x32x64` * Add support for microscaling instructions in `XdlopsGemm` * Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper * Remove software implementation of MX GEMM * Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction * Update README * Updated CHANGELOG * Remove unused static methods	2025-04-15 17:17:07 -06:00
Mingtao Gu	56378f810f	CK pk_i4_t test failures fix (SWDEV-518629) (#2075 ) * fix pk_i4_v3 tests failures in Unbuntu env. * fix pk_i4_t tests failure on Unbuntu issues. * some fixed. --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-04-14 16:58:57 +08:00
Illia Silin	0d4f145078	Fix build issues for multiple targets. (#2077 ) * build for multiple targets on gfx942 * add missing ignore statements	2025-04-11 12:12:53 -07:00
Illia Silin	3e6d21adeb	enable gfx115x support (#2065 )	2025-04-09 10:06:42 -07:00
valarLip	2c563fecf7	add passthrough for int32->float32 (#2062 )	2025-04-08 15:16:30 -07:00
Max Podkorytov	6ce0797dad	simplify generate_tuple (#2043 )	2025-04-08 09:00:51 -07:00
Illia Silin	1793228422	fix codegen issues (#2052 )	2025-04-07 07:08:39 -07:00
Illia Silin	572cd820ce	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic	2025-04-03 15:30:21 -07:00
Rostyslav Geyyer	265af71a71	Add FP16/BF16<->FP8/BF8 conversions (#2035 ) * Move conversion functions and add missing conversions * Add tests * Add missing conversions * Add missing conversions * Add bf8 tests * Update clipping for vectors * Add missing conversions * Add bf16 fp8 tests * Add bf16 bf8 tests * Fix device conversion * Fix conversions * Fix vector use * Minor fix * Add a workaround flag * Add a workaround flag for bf16 conversion * Add another workaround * Add a workaround for fp16 to bf8 conversion * Update type alias * Add docstrings and missing wrappers * Fix if defined macros * Fix more if defined macros * Add comments * Remove __host__ specifier * Add a gfx950 guard * Update function naming	2025-04-03 12:42:03 -05:00
aledudek	9329432f6c	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments	2025-04-03 13:35:43 +02:00
Bartłomiej Kocot	2ccf914888	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix	2025-04-02 23:59:49 +02:00
Adam Osewski	e5ad48a784	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-04-02 11:03:40 +02:00
Bartłomiej Kocot	8c0ab61ece	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	ec742908bd	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test	2025-04-01 13:19:35 -07:00
Seunghoon Lee	df32020f93	Fix Windows build. (#2012 ) * Remove duplicate using uint64_t. * Cast before shift.	2025-04-01 12:22:10 -07:00
Max Podkorytov	c59a8bb206	add a fast compilation path for static for (0..N) (#2005 ) * add a fast compilation path for static for (0..N) * Update functional2.hpp add comment and put range applier into detail namespace * Update functional.hpp ditto for ck-tile * prettify * prettify more * add comment * clang-format	2025-04-01 12:06:25 -07:00
Rostyslav Geyyer	441343a23d	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	72d888821c	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales	2025-03-25 15:33:03 -06:00
Max Podkorytov	1a58522f01	use fast path for sequence generation in old CK (#1993 )	2025-03-25 11:28:44 -07:00
Illia Silin	d2eab23958	Split up data_type header. (#1996 ) * split fp64 vector data type * add missing header * move e8m0 structs * split off numeric_utils header * fix typo * split off numeric limits header * update data_type header * fix clang format * split off vector type header * fix clang format * fix typo for binary_inf	2025-03-24 15:08:54 -07:00
Andriy Roshchenko	6660dc6b8e	Introduce MX GEMM for FP8 data type (#2000 )	2025-03-24 15:41:07 -06:00
Bartłomiej Kocot	5b0873c31a	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments	2025-03-22 23:19:49 +01:00
Attila T. Áfra	c79bf11148	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8')	2025-03-20 12:37:25 -07:00
felix	7eaedeb36c	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-03-19 22:58:27 +08:00
aledudek	5095906975	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00

1 2 3 4 5 ...

603 Commits