composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 10:37:44 +00:00

Author	SHA1	Message	Date
Anton Gorenko	db016cf6da	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on `5585c3121e` * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> [ROCm/composable_kernel commit: `edd92fc546`]	2025-04-28 10:14:21 +05:00
Illia Silin	93caa99277	make code compliant with std=c++20 (#2123 ) [ROCm/composable_kernel commit: `01cb8379cd`]	2025-04-24 10:14:52 -07:00
lalala-sh	7748793e09	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `39ba03f25d`]	2025-04-23 10:35:34 +08:00
Rostyslav Geyyer	9a8af1c82a	Temporarily disable MX FP4 device tests (#2112 ) [ROCm/composable_kernel commit: `416e851584`]	2025-04-22 16:08:48 -05:00
Thomas Ning	79b01b0824	GEMM Multiply Multiply Fix (#2102 ) * fix the type convert and increase the BF16 conversion + the profile comment * fix the CI [ROCm/composable_kernel commit: `0cca8fa28f`]	2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk	16fd74e8d3	MI308 fix for streamk 1-Tile floating point exception (#2101 ) [ROCm/composable_kernel commit: `b092c18da7`]	2025-04-21 11:44:07 -07:00
lalala-sh	2d0b5aba13	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete [ROCm/composable_kernel commit: `bcf5bb41be`]	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	6d0890b6f4	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation [ROCm/composable_kernel commit: `213b203a3c`]	2025-04-16 19:56:00 -06:00
Andriy Roshchenko	348760d56e	MX GEMM - Add MX BF8 example (#2071 ) * Add MX GEMM example for MX BF8 * Verified MX FP8 with 16x16x128 scale builtin * Verify MX BF8 GEMM with BF16 output [ROCm/composable_kernel commit: `da54464cce`]	2025-04-16 15:25:02 -06:00
Andriy Roshchenko	d9c9f17c3d	MX GEMM - New GEMM pipeline for MX data types (#2059 ) * Allow selection of mfma_scale instructions * Read B tensor from LDS to VGPR in chunks of 16 in MFMA order * Add constexpr and synchronize return type for `get_exponent_value` * Pass scales by reference and add comments to `mfma_scale_f32_32x32x64` * Add support for microscaling instructions in `XdlopsGemm` * Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper * Remove software implementation of MX GEMM * Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction * Update README * Updated CHANGELOG * Remove unused static methods [ROCm/composable_kernel commit: `7106976a72`]	2025-04-15 17:17:07 -06:00
Mingtao Gu	e8db9f0220	CK pk_i4_t test failures fix (SWDEV-518629) (#2075 ) * fix pk_i4_v3 tests failures in Unbuntu env. * fix pk_i4_t tests failure on Unbuntu issues. * some fixed. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `56378f810f`]	2025-04-14 16:58:57 +08:00
Illia Silin	90612d0e37	Fix build issues for multiple targets. (#2077 ) * build for multiple targets on gfx942 * add missing ignore statements [ROCm/composable_kernel commit: `0d4f145078`]	2025-04-11 12:12:53 -07:00
Illia Silin	7546e4bafe	enable gfx115x support (#2065 ) [ROCm/composable_kernel commit: `3e6d21adeb`]	2025-04-09 10:06:42 -07:00
valarLip	c1d067be5c	add passthrough for int32->float32 (#2062 ) [ROCm/composable_kernel commit: `2c563fecf7`]	2025-04-08 15:16:30 -07:00
Max Podkorytov	26724086f3	simplify generate_tuple (#2043 ) [ROCm/composable_kernel commit: `6ce0797dad`]	2025-04-08 09:00:51 -07:00
Illia Silin	32879114dc	fix codegen issues (#2052 ) [ROCm/composable_kernel commit: `1793228422`]	2025-04-07 07:08:39 -07:00
Illia Silin	ada1b5f341	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic [ROCm/composable_kernel commit: `572cd820ce`]	2025-04-03 15:30:21 -07:00
Rostyslav Geyyer	7fbc128e83	Add FP16/BF16<->FP8/BF8 conversions (#2035 ) * Move conversion functions and add missing conversions * Add tests * Add missing conversions * Add missing conversions * Add bf8 tests * Update clipping for vectors * Add missing conversions * Add bf16 fp8 tests * Add bf16 bf8 tests * Fix device conversion * Fix conversions * Fix vector use * Minor fix * Add a workaround flag * Add a workaround flag for bf16 conversion * Add another workaround * Add a workaround for fp16 to bf8 conversion * Update type alias * Add docstrings and missing wrappers * Fix if defined macros * Fix more if defined macros * Add comments * Remove __host__ specifier * Add a gfx950 guard * Update function naming [ROCm/composable_kernel commit: `265af71a71`]	2025-04-03 12:42:03 -05:00
aledudek	b7359bcfac	Post-merge changes for fully async args copy in ck grouped gemm (#1991 ) * Post-merge changes for fully async args copy in ck grouped gemm * Post-merge documentation and naming changes * Build fix and updated changelog * Revised comments [ROCm/composable_kernel commit: `9329432f6c`]	2025-04-03 13:35:43 +02:00
Bartłomiej Kocot	169e3cb4f8	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix [ROCm/composable_kernel commit: `2ccf914888`]	2025-04-02 23:59:49 +02:00
Adam Osewski	5585c3121e	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `e5ad48a784`]	2025-04-02 11:03:40 +02:00
Bartłomiej Kocot	ca7ae808d4	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances [ROCm/composable_kernel commit: `8c0ab61ece`]	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	67c3bcfce1	Grouped conv fwd v3 fix for SplitN an G > 1 (#2038 ) * Grouped conv fwd v3 fix for SplitN an G > 1 * Remove int8 large test * Retore int8 test [ROCm/composable_kernel commit: `ec742908bd`]	2025-04-01 13:19:35 -07:00
Seunghoon Lee	345ab65612	Fix Windows build. (#2012 ) * Remove duplicate using uint64_t. * Cast before shift. [ROCm/composable_kernel commit: `df32020f93`]	2025-04-01 12:22:10 -07:00
Max Podkorytov	cf08db04a6	add a fast compilation path for static for (0..N) (#2005 ) * add a fast compilation path for static for (0..N) * Update functional2.hpp add comment and put range applier into detail namespace * Update functional.hpp ditto for ck-tile * prettify * prettify more * add comment * clang-format [ROCm/composable_kernel commit: `c59a8bb206`]	2025-04-01 12:06:25 -07:00
Rostyslav Geyyer	23ad59e1fd	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround [ROCm/composable_kernel commit: `441343a23d`]	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	6ccfb817e4	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes [ROCm/composable_kernel commit: `54c81a1fcf`]	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	75ef4c83bf	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales [ROCm/composable_kernel commit: `72d888821c`]	2025-03-25 15:33:03 -06:00
Max Podkorytov	58789d03d3	use fast path for sequence generation in old CK (#1993 ) [ROCm/composable_kernel commit: `1a58522f01`]	2025-03-25 11:28:44 -07:00
Illia Silin	b9e0e7d93e	Split up data_type header. (#1996 ) * split fp64 vector data type * add missing header * move e8m0 structs * split off numeric_utils header * fix typo * split off numeric limits header * update data_type header * fix clang format * split off vector type header * fix clang format * fix typo for binary_inf [ROCm/composable_kernel commit: `d2eab23958`]	2025-03-24 15:08:54 -07:00
Andriy Roshchenko	bbdd7f6d57	Introduce MX GEMM for FP8 data type (#2000 ) [ROCm/composable_kernel commit: `6660dc6b8e`]	2025-03-24 15:41:07 -06:00
Bartłomiej Kocot	ceb078163f	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments [ROCm/composable_kernel commit: `5b0873c31a`]	2025-03-22 23:19:49 +01:00
Attila T. Áfra	081e3c7880	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8') [ROCm/composable_kernel commit: `c79bf11148`]	2025-03-20 12:37:25 -07:00
felix	c2948a0634	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7eaedeb36c`]	2025-03-19 22:58:27 +08:00
aledudek	73d207bd4e	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue [ROCm/composable_kernel commit: `5095906975`]	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	b8f58a234e	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example [ROCm/composable_kernel commit: `c2e4898b4b`]	2025-03-17 13:32:00 +01:00
feli	e3c5b2ae80	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `251afab3b7`]	2025-03-12 09:22:42 -07:00
Illia Silin	a7614ad594	use old instrinsics with staging compiler (#1970 ) [ROCm/composable_kernel commit: `4c97cc511e`]	2025-03-12 07:29:09 -07:00
Haocong WANG	1ed0b74c43	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `cbd74c2d12`]	2025-03-11 10:11:21 -07:00
Mingtao Gu	fc98615212	Ck int4 moe develop (#1949 ) * Add Gemm fp8xint4 example and kernel, function pass. * Init Gemm_fp8xint4 Bpreshuffle * Added gemm_fp8xint4_Bpreshuffle files, function not checked yet * General fix. * fp8xint4 bpreshuffle function pass * fix. * init b preshuffle dequant in VGPR. * fix bug, function pass. * move b thread dequant copy to blockwise. * fix bug, function now passes. * modified the tile size to 256, 128x128x128. * fixed a bug. * Initial int4 moe, compile pass, function not check. * fix bug in moe_gemm1.cpp, now function pass. * test expert = 8 and function pass. * Added moe_pk_i4_gemm2, function pass. * Added b preshuffle pipeline v3 support. * fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass. * Split the blockwise pipeline for fp8xint4. * commit missing files * opt gemm2 to 2x2 wave * fix swizzle = false * update int4 moe with latest input changes. * update tile size. * enable pipeline v3. * fix nswizzle = true * commit a version for compiler debug. * Updated transfer_v3r1_gather to support pk_i4_t type. * for int4 moe2 for type_convert support. * remove some values between mfma instructions. * fix int4 moe * Updated transfer_v3r1_gather to support pk_i4_t type. * i4 support lds multiple shuffle * fixed int4 moe tflops calculation. * Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle * updated gemm2. * change int4 moe example names * fix and format code. * format. * format codes. * update fp8xint4 example tile size. * add <unordered_map> header * fixed. * format. * Added conditional compilation for int4 -> fp8 conversion kernels --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `0db7c8f0b2`]	2025-03-10 11:16:44 +08:00
Thomas Ning	ed0649e4e6	Fix on the error (#1956 ) [ROCm/composable_kernel commit: `9d51d17dd0`]	2025-03-07 13:43:52 -08:00
carlushuang	31b9e40b3d	[CK_TILE][HOTFIX] WA for address space by disable it completely (#1947 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format * WA for address space by disable it completely * hot fix moe gemm2 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `c12fb0a624`]	2025-03-06 12:01:25 +08:00
Illia Silin	9d24409070	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog [ROCm/composable_kernel commit: `a88bf76ecc`]	2025-03-05 14:33:28 -08:00
Illia Silin	53b89436b2	remove support for gfx940 and gfx941 targets (#1944 ) * remove support for gfx940 and gfx941 targets * update changelog [ROCm/composable_kernel commit: `9b51c08bf7`]	2025-03-05 11:07:33 -08:00
feli	6fd94cff45	ck moe gemm implement (#1936 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `3786e16375`]	2025-03-05 15:56:55 +08:00
jefyang1	dfd15c220d	Remove CK_USE_AMD_MFMA_GFX950 (#1935 ) * Add runtime check in example_gemm_xdl_streamk for gfx950 * Add runtime check in grouped conv fwd examples for gfx950 * Disable CK_USE_AMD_MFMA_GFX950 * Add new instances for gfx950 * Fix test_gemm_universal on gfx950 [ROCm/composable_kernel commit: `c95bda93ba`]	2025-03-04 10:32:25 -08:00
arai713	afc6806202	MIGraphX hipRTC fix (#1923 ) * fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file) * fixed device op error: misplaced header guard * restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation * add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation * fixed conditional error in amd_ck_fp8.hpp * Add MaskOutUpperTriangle as a problem parameter to BatchedGemmSoftmaxGemm and disable tests with MaskOutUpperTriangle==True. Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com> [ROCm/composable_kernel commit: `fd06ed926c`]	2025-03-03 07:55:05 -08:00
asleepzzz	88ce49f185	Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 )" (#1933 ) This reverts commit `1d09b0928c`. [ROCm/composable_kernel commit: `ef16010273`]	2025-03-03 07:17:39 -08:00
Haocong WANG	1d09b0928c	[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> [ROCm/composable_kernel commit: `020148d0f7`]	2025-02-25 15:42:20 +08:00
coconutruben	eb914952cc	device_prop.hpp - replace map with compile time hash and switch (#1898 ) * device_prop.hpp - replace map with compile time hash and switch Summary: We replace a static const map with a compile time hash function and a switch statement to achieve the same goal: translate names to architectures. Most of these are very old, however the function needs to continue to work. Why? because the static map can cause issues when compiling into libraries that get dynamically loaded/unloaded, leading to memory corruption Test Plan: Running pytorch `torch.compile()` with CK enabled, and seeing it not segfault on the 2nd kernel (1st reload of the library) Reviewers: Subscribers: Tasks: Tags: * clang-format [ROCm/composable_kernel commit: `fcd4a6f3d1`]	2025-02-24 09:57:55 -08:00

1 2 3 4 5 ...

589 Commits