composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
lalala-sh	01344f2e19	add switch for topk weights in each stages	2025-04-12 15:47:14 +00:00
lalala-sh	0353337447	refine activation code & complete moe example	2025-04-11 02:18:16 +00:00
lalala-sh	d2a82f56e2	remove useless comments	2025-04-10 02:23:08 +00:00
lalala-sh	4ff3ca6a0f	fix fp8 gufusion bug	2025-04-10 02:03:38 +00:00
lalala-sh	8da84640a4	Merge branch 'develop' into moe_gemm_activation	2025-04-10 09:35:03 +08:00
Illia Silin	3e6d21adeb	enable gfx115x support (#2065 )	2025-04-09 10:06:42 -07:00
lalala-sh	4c2abb376a	Merge branch 'develop' into moe_gemm_activation	2025-04-09 15:15:00 +08:00
MHYang-gh	03ce8729fd	Make buffer coherence configurable in tensor view (#2041 ) * Make buffer coherence configurable in tensor view * Fix clang-format for tensor_view.hpp	2025-04-08 15:34:11 -07:00
valarLip	2c563fecf7	add passthrough for int32->float32 (#2062 )	2025-04-08 15:16:30 -07:00
Max Podkorytov	6ce0797dad	simplify generate_tuple (#2043 )	2025-04-08 09:00:51 -07:00
aledudek	80aae6119b	[CK_TILE] Fix GEMM Memory Pipeline (#2034 ) * [CK_TILE] Fix GEMM Memory Pipeline * Fix transpose tile * Add comments	2025-04-08 12:40:04 +02:00
root	fbf91ada78	merge from testx	2025-04-08 06:55:35 +00:00
lalala-sh	62f99e5cca	Merge branch 'develop' into moe_gemm_activation	2025-04-08 14:43:41 +08:00
root	c0c1c04b50	fix bugs	2025-04-08 06:26:57 +00:00
Illia Silin	1793228422	fix codegen issues (#2052 )	2025-04-07 07:08:39 -07:00
Illia Silin	572cd820ce	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic	2025-04-03 15:30:21 -07:00
Rostyslav Geyyer	265af71a71	Add FP16/BF16<->FP8/BF8 conversions (#2035 ) * Move conversion functions and add missing conversions * Add tests * Add missing conversions * Add missing conversions * Add bf8 tests * Update clipping for vectors * Add missing conversions * Add bf16 fp8 tests * Add bf16 bf8 tests * Fix device conversion * Fix conversions * Fix vector use * Minor fix * Add a workaround flag * Add a workaround flag for bf16 conversion * Add another workaround * Add a workaround for fp16 to bf8 conversion * Update type alias * Add docstrings and missing wrappers * Fix if defined macros * Fix more if defined macros * Add comments * Remove __host__ specifier * Add a gfx950 guard * Update function naming	2025-04-03 12:42:03 -05:00
aledudek	9329432f6c	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments	2025-04-03 13:35:43 +02:00
root	20f6674bf6	fix no quant case	2025-04-03 02:46:01 +00:00
Bartłomiej Kocot	2ccf914888	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix	2025-04-02 23:59:49 +02:00
root	b2b34fffbb	fix fp8 16x16	2025-04-02 16:27:52 +00:00
Adam Osewski	e5ad48a784	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-04-02 11:03:40 +02:00
root	85f83330b5	fuse moe activation	2025-04-02 07:02:09 +00:00
Bartłomiej Kocot	8c0ab61ece	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	ec742908bd	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test	2025-04-01 13:19:35 -07:00
Illia Silin	dcfec66bc4	Merge branch 'develop' into moe_gemm_activation	2025-04-01 12:28:10 -07:00
Seunghoon Lee	df32020f93	Fix Windows build. (#2012 ) * Remove duplicate using uint64_t. * Cast before shift.	2025-04-01 12:22:10 -07:00
Max Podkorytov	c59a8bb206	add a fast compilation path for static for (0..N) (#2005 ) * add a fast compilation path for static for (0..N) * Update functional2.hpp add comment and put range applier into detail namespace * Update functional.hpp ditto for ck-tile * prettify * prettify more * add comment * clang-format	2025-04-01 12:06:25 -07:00
illsilin	d3fb5a9b8d	fix clang format	2025-03-28 11:52:40 -07:00
rocking	8a20b62e91	Reduce redundant space in bias tensor (#2024 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-28 21:58:06 +08:00
lalala-sh	656a3657cb	remove useless code change	2025-03-28 16:25:00 +08:00
root	7054d45579	remove useless changes	2025-03-28 08:21:42 +00:00
root	c9ce26837c	Merge remote-tracking branch 'origin/develop' into moe_gemm_activation	2025-03-28 07:57:53 +00:00
felix	a82f338fb9	hotfix fix sorting int64 (#2025 ) * fix sorting int64 * clang format * fix example issue * update WA issue # --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2025-03-28 11:31:52 +08:00
root	529a1732cd	fp8 with act ready	2025-03-28 02:34:38 +00:00
lalala-sh	d0f3e87129	Merge remote-tracking branch 'origin/develop' into moe_gemm_activation	2025-03-28 10:02:27 +08:00
root	4f4bce30cf	fuse gelu silu act in moe gemm1	2025-03-27 16:59:53 +08:00
coderfeli	bd4a8c71d4	i4 gemm2 ok and i4 gemm1 build	2025-03-27 06:41:31 +00:00
Rostyslav Geyyer	441343a23d	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
coderfeli	9729c9e3f7	i4 gemm2 ok	2025-03-26 11:41:21 +00:00
coderfeli	6a0cc4aad1	gu fusion for m32 m64 ok	2025-03-26 05:58:22 +00:00
coderfeli	74d8ac608f	gufusion compatible ok, fix warnings	2025-03-26 02:20:30 +00:00
Andriy Roshchenko	72d888821c	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales	2025-03-25 15:33:03 -06:00
Max Podkorytov	1a58522f01	use fast path for sequence generation in old CK (#1993 )	2025-03-25 11:28:44 -07:00
coderfeli	6ca5892256	gemm2 ok	2025-03-25 15:01:10 +00:00
ruanjm	d49abdaa87	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx	2025-03-25 20:09:45 +08:00
coderfeli	234b8d415c	change code	2025-03-25 09:44:32 +00:00
coderfeli	0d266bfd65	add silu	2025-03-25 03:01:27 +00:00
coderfeli	2b15b67b3f	acale ok	2025-03-25 02:52:04 +00:00

1 2 3 4 5 ...

781 Commits