composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-16 19:09:59 +00:00

Author	SHA1	Message	Date
lalala-sh	8f426b1216	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `39ba03f25d`]	2025-04-23 10:35:34 +08:00
Rostyslav Geyyer	dbcaae42bf	Temporarily disable MX FP4 device tests (#2112 ) [ROCm/composable_kernel commit: `416e851584`]	2025-04-22 16:08:48 -05:00
Thomas Ning	005e61ce63	GEMM Multiply Multiply Fix (#2102 ) * fix the type convert and increase the BF16 conversion + the profile comment * fix the CI [ROCm/composable_kernel commit: `0cca8fa28f`]	2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk	eb8306b5f0	MI308 fix for streamk 1-Tile floating point exception (#2101 ) [ROCm/composable_kernel commit: `b092c18da7`]	2025-04-21 11:44:07 -07:00
lalala-sh	dc58110b06	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete [ROCm/composable_kernel commit: `bcf5bb41be`]	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	7972a39081	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation [ROCm/composable_kernel commit: `213b203a3c`]	2025-04-16 19:56:00 -06:00
Andriy Roshchenko	a96e96ab4e	MX GEMM - Add MX BF8 example (#2071 ) * Add MX GEMM example for MX BF8 * Verified MX FP8 with 16x16x128 scale builtin * Verify MX BF8 GEMM with BF16 output [ROCm/composable_kernel commit: `da54464cce`]	2025-04-16 15:25:02 -06:00
Andriy Roshchenko	5e2bd20672	MX GEMM - New GEMM pipeline for MX data types (#2059 ) * Allow selection of mfma_scale instructions * Read B tensor from LDS to VGPR in chunks of 16 in MFMA order * Add constexpr and synchronize return type for `get_exponent_value` * Pass scales by reference and add comments to `mfma_scale_f32_32x32x64` * Add support for microscaling instructions in `XdlopsGemm` * Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper * Remove software implementation of MX GEMM * Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction * Update README * Updated CHANGELOG * Remove unused static methods [ROCm/composable_kernel commit: `7106976a72`]	2025-04-15 17:17:07 -06:00
Mingtao Gu	3cbda9a11c	CK pk_i4_t test failures fix (SWDEV-518629) (#2075 ) * fix pk_i4_v3 tests failures in Unbuntu env. * fix pk_i4_t tests failure on Unbuntu issues. * some fixed. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `56378f810f`]	2025-04-14 16:58:57 +08:00
Illia Silin	c91d046350	Fix build issues for multiple targets. (#2077 ) * build for multiple targets on gfx942 * add missing ignore statements [ROCm/composable_kernel commit: `0d4f145078`]	2025-04-11 12:12:53 -07:00
Illia Silin	9cc561987f	enable gfx115x support (#2065 ) [ROCm/composable_kernel commit: `3e6d21adeb`]	2025-04-09 10:06:42 -07:00
valarLip	aa6eeb7f66	add passthrough for int32->float32 (#2062 ) [ROCm/composable_kernel commit: `2c563fecf7`]	2025-04-08 15:16:30 -07:00
Max Podkorytov	1eec5cc073	simplify generate_tuple (#2043 ) [ROCm/composable_kernel commit: `6ce0797dad`]	2025-04-08 09:00:51 -07:00
Illia Silin	ad65c947c9	fix codegen issues (#2052 ) [ROCm/composable_kernel commit: `1793228422`]	2025-04-07 07:08:39 -07:00
Illia Silin	6d90b2eb50	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic [ROCm/composable_kernel commit: `572cd820ce`]	2025-04-03 15:30:21 -07:00
Rostyslav Geyyer	1716380358	Add FP16/BF16<->FP8/BF8 conversions (#2035 ) * Move conversion functions and add missing conversions * Add tests * Add missing conversions * Add missing conversions * Add bf8 tests * Update clipping for vectors * Add missing conversions * Add bf16 fp8 tests * Add bf16 bf8 tests * Fix device conversion * Fix conversions * Fix vector use * Minor fix * Add a workaround flag * Add a workaround flag for bf16 conversion * Add another workaround * Add a workaround for fp16 to bf8 conversion * Update type alias * Add docstrings and missing wrappers * Fix if defined macros * Fix more if defined macros * Add comments * Remove __host__ specifier * Add a gfx950 guard * Update function naming [ROCm/composable_kernel commit: `265af71a71`]	2025-04-03 12:42:03 -05:00
aledudek	7a78bc823a	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments [ROCm/composable_kernel commit: `9329432f6c`]	2025-04-03 13:35:43 +02:00
Bartłomiej Kocot	49565538fe	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix [ROCm/composable_kernel commit: `2ccf914888`]	2025-04-02 23:59:49 +02:00
Adam Osewski	0a607256cf	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `e5ad48a784`]	2025-04-02 11:03:40 +02:00
Bartłomiej Kocot	50fb390d6f	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances [ROCm/composable_kernel commit: `8c0ab61ece`]	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	b417137276	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test [ROCm/composable_kernel commit: `ec742908bd`]	2025-04-01 13:19:35 -07:00
Seunghoon Lee	0ac4c37028	Fix Windows build. (#2012 ) * Remove duplicate using uint64_t. * Cast before shift. [ROCm/composable_kernel commit: `df32020f93`]	2025-04-01 12:22:10 -07:00
Max Podkorytov	70ad8571c2	add a fast compilation path for static for (0..N) (#2005 ) * add a fast compilation path for static for (0..N) * Update functional2.hpp add comment and put range applier into detail namespace * Update functional.hpp ditto for ck-tile * prettify * prettify more * add comment * clang-format [ROCm/composable_kernel commit: `c59a8bb206`]	2025-04-01 12:06:25 -07:00
Rostyslav Geyyer	48fa126a9e	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround [ROCm/composable_kernel commit: `441343a23d`]	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	f967fd7296	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes [ROCm/composable_kernel commit: `54c81a1fcf`]	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	3f06d019ba	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales [ROCm/composable_kernel commit: `72d888821c`]	2025-03-25 15:33:03 -06:00
Max Podkorytov	a70c642a78	use fast path for sequence generation in old CK (#1993 ) [ROCm/composable_kernel commit: `1a58522f01`]	2025-03-25 11:28:44 -07:00
Illia Silin	9dcfd79dcb	Split up data_type header. (#1996 ) * split fp64 vector data type * add missing header * move e8m0 structs * split off numeric_utils header * fix typo * split off numeric limits header * update data_type header * fix clang format * split off vector type header * fix clang format * fix typo for binary_inf [ROCm/composable_kernel commit: `d2eab23958`]	2025-03-24 15:08:54 -07:00
Andriy Roshchenko	6b2b228eb4	Introduce MX GEMM for FP8 data type (#2000 ) [ROCm/composable_kernel commit: `6660dc6b8e`]	2025-03-24 15:41:07 -06:00
Bartłomiej Kocot	61b5e8318a	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments [ROCm/composable_kernel commit: `5b0873c31a`]	2025-03-22 23:19:49 +01:00
Attila T. Áfra	8cf49a44a9	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8') [ROCm/composable_kernel commit: `c79bf11148`]	2025-03-20 12:37:25 -07:00
felix	7e4c8a56ed	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7eaedeb36c`]	2025-03-19 22:58:27 +08:00
aledudek	94a88e2ecc	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue [ROCm/composable_kernel commit: `5095906975`]	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	36f9cc5fb0	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example [ROCm/composable_kernel commit: `c2e4898b4b`]	2025-03-17 13:32:00 +01:00
feli	4865ff49f3	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `251afab3b7`]	2025-03-12 09:22:42 -07:00
Illia Silin	73a1f03e58	use old instrinsics with staging compiler (#1970 ) [ROCm/composable_kernel commit: `4c97cc511e`]	2025-03-12 07:29:09 -07:00
Haocong WANG	b31ac9d14d	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `cbd74c2d12`]	2025-03-11 10:11:21 -07:00
Mingtao Gu	68987f96d4	Ck int4 moe develop (#1949 ) * Add Gemm fp8xint4 example and kernel, function pass. * Init Gemm_fp8xint4 Bpreshuffle * Added gemm_fp8xint4_Bpreshuffle files, function not checked yet * General fix. * fp8xint4 bpreshuffle function pass * fix. * init b preshuffle dequant in VGPR. * fix bug, function pass. * move b thread dequant copy to blockwise. * fix bug, function now passes. * modified the tile size to 256, 128x128x128. * fixed a bug. * Initial int4 moe, compile pass, function not check. * fix bug in moe_gemm1.cpp, now function pass. * test expert = 8 and function pass. * Added moe_pk_i4_gemm2, function pass. * Added b preshuffle pipeline v3 support. * fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass. * Split the blockwise pipeline for fp8xint4. * commit missing files * opt gemm2 to 2x2 wave * fix swizzle = false * update int4 moe with latest input changes. * update tile size. * enable pipeline v3. * fix nswizzle = true * commit a version for compiler debug. * Updated transfer_v3r1_gather to support pk_i4_t type. * for int4 moe2 for type_convert support. * remove some values between mfma instructions. * fix int4 moe * Updated transfer_v3r1_gather to support pk_i4_t type. * i4 support lds multiple shuffle * fixed int4 moe tflops calculation. * Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle * updated gemm2. * change int4 moe example names * fix and format code. * format. * format codes. * update fp8xint4 example tile size. * add <unordered_map> header * fixed. * format. * Added conditional compilation for int4 -> fp8 conversion kernels --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `0db7c8f0b2`]	2025-03-10 11:16:44 +08:00
Thomas Ning	26ef4eed97	Fix on the error (#1956 ) [ROCm/composable_kernel commit: `9d51d17dd0`]	2025-03-07 13:43:52 -08:00
carlushuang	710aa99819	[CK_TILE][HOTFIX] WA for address space by disable it completely (#1947 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format * WA for address space by disable it completely * hot fix moe gemm2 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `c12fb0a624`]	2025-03-06 12:01:25 +08:00
Illia Silin	9673ebcd71	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog [ROCm/composable_kernel commit: `a88bf76ecc`]	2025-03-05 14:33:28 -08:00
Illia Silin	2bbbd3c6e0	remove support for gfx940 and gfx941 targets (#1944 ) * remove support for gfx940 and gfx941 targets * update changelog [ROCm/composable_kernel commit: `9b51c08bf7`]	2025-03-05 11:07:33 -08:00
feli	dc656b04d8	ck moe gemm implement (#1936 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `3786e16375`]	2025-03-05 15:56:55 +08:00
jefyang1	80059eaa3b	Remove CK_USE_AMD_MFMA_GFX950 (#1935 ) * Add runtime check in example_gemm_xdl_streamk for gfx950 * Add runtime check in grouped conv fwd examples for gfx950 * Disable CK_USE_AMD_MFMA_GFX950 * Add new instances for gfx950 * Fix test_gemm_universal on gfx950 [ROCm/composable_kernel commit: `c95bda93ba`]	2025-03-04 10:32:25 -08:00
arai713	1b65b09009	MIGraphX hipRTC fix (#1923 ) * fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file) * fixed device op error: misplaced header guard * restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation * add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation * fixed conditional error in amd_ck_fp8.hpp * Add MaskOutUpperTriangle as a problem parameter to BatchedGemmSoftmaxGemm and disable tests with MaskOutUpperTriangle==True. Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com> [ROCm/composable_kernel commit: `fd06ed926c`]	2025-03-03 07:55:05 -08:00
asleepzzz	de1a2544c6	Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 )" (#1933 ) This reverts commit 06e1eee9bbb737f0ee3b2374f1857838c2b8ef3f. [ROCm/composable_kernel commit: `ef16010273`]	2025-03-03 07:17:39 -08:00
Haocong WANG	f43d87b09c	[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> [ROCm/composable_kernel commit: `020148d0f7`]	2025-02-25 15:42:20 +08:00
coconutruben	828eaf2e5f	device_prop.hpp - replace map with compile time hash and switch (#1898 ) * device_prop.hpp - replace map with compile time hash and switch Summary: We replace a static const map with a compile time hash function and a switch statement to achieve the same goal: translate names to architectures. Most of these are very old, however the function needs to continue to work. Why? because the static map can cause issues when compiling into libraries that get dynamically loaded/unloaded, leading to memory corruption Test Plan: Running pytorch `torch.compile()` with CK enabled, and seeing it not segfault on the 2nd kernel (1st reload of the library) Reviewers: Subscribers: Tasks: Tags: * clang-format [ROCm/composable_kernel commit: `fcd4a6f3d1`]	2025-02-24 09:57:55 -08:00
Andriy Roshchenko	c3175995ba	MX FP GEMM - Test MX FP8 MFMA Instructions (#1902 ) * Refactored `load_A_row_major` to follow scale mapping * Refactored `load_A_col_major` to follow scale mapping * Refactored `load_B_col_major` to follow scale mapping * Verified non-scaled test * Verified scaled tests * Used ReferenceMXGemm for verification * Updated license headers [ROCm/composable_kernel commit: `ffa13455a2`]	2025-02-21 13:35:54 -07:00
Illia Silin	379249cce3	Rebase the PR #1520 to ROCm repo. (#1574 ) * Implement hiprtc for codegen tests * Introduce gemm_softmax_gemm to codegen. * Fix codegen build issues. * Address PR comments. * Separate ck_host lib and gemm_softmax_gemm into different PR. * Fix cmake. * Replace ENV variable with CMake option for toggling hipRTC in codegen tests. * Address PR comments. * fix clang format * Add missing header in magic_division.hpp * - Workaround for hipRTC content wrapper - Move descriptor for gemm_softmax_gemm to different branch * Fix formatting. * Revert "Fix formatting." This reverts commit `b5209eaef4`. * formatting fix * fixed header guard issues * updated header guards * updated data_type for new types * fixed redefinition error * Add codegen test for batched_gemm_softmax_gemm. Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> * formatting fix --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Dino Musić <dino.music@htecgroup.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@htecgroup.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com> Co-authored-by: Astha Rai <astha.rai713@gmail.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com> [ROCm/composable_kernel commit: `68a08c872e`]	2025-02-20 18:58:14 -08:00

1 2 3 4 5 ...

587 Commits