composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 00:58:44 +00:00

Author	SHA1	Message	Date
jakpiase	9bf1c41338	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests [ROCm/composable_kernel commit: `0e91d32c61`]	2025-03-20 11:17:04 +01:00
rocking	8be61cfc9d	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `b819c217e4`]	2025-03-20 00:06:45 +08:00
felix	7e4c8a56ed	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7eaedeb36c`]	2025-03-19 22:58:27 +08:00
Bartłomiej Kocot	71c6106947	Add grouped conv bwd wei merged grouped instance for larger filter (#1984 ) * Add grouped conv bwd wei merged grouped instance for larger filter * Update readme [ROCm/composable_kernel commit: `fdaff5603e`]	2025-03-18 16:16:24 +01:00
Illia Silin	8df9260b3e	Add a daily CI build on gfx908. (#1987 ) * add one daily ci build on gfx908 * add redis invocation tag for gfx908 * make ci build for gfx908 conditional * fix groovy logic * add option to run perf tests for gfx908 * disable a few tests on mi100 [ROCm/composable_kernel commit: `1342ecf7fb`]	2025-03-17 18:08:53 -07:00
Illia Silin	56ad0cdd62	disable ck_tile basic gemm (#1986 ) [ROCm/composable_kernel commit: `07f25186b2`]	2025-03-17 15:26:43 -07:00
aledudek	94a88e2ecc	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue [ROCm/composable_kernel commit: `5095906975`]	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	36f9cc5fb0	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example [ROCm/composable_kernel commit: `c2e4898b4b`]	2025-03-17 13:32:00 +01:00
valarLip	3d06952a2b	hotfix fmoe build issue (#1976 ) [ROCm/composable_kernel commit: `52b1cd7780`]	2025-03-13 15:11:59 +08:00
dependabot[bot]	4643738f4d	Bump rocm-docs-core from 1.17.1 to 1.18.1 in /docs/sphinx (#1977 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.1...v1.18.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `de7a745ca6`]	2025-03-12 23:36:36 -07:00
carlushuang	5293517d0a	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit a222573537b139c8a4f6b870910bdd487935efa8. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e81279d26`]	2025-03-13 11:41:39 +08:00
Illia Silin	2559223c72	disable tests that take too long to build for gfx90a (#1975 ) [ROCm/composable_kernel commit: `d4a6d69643`]	2025-03-12 17:54:03 -07:00
feli	4865ff49f3	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `251afab3b7`]	2025-03-12 09:22:42 -07:00
Illia Silin	73a1f03e58	use old instrinsics with staging compiler (#1970 ) [ROCm/composable_kernel commit: `4c97cc511e`]	2025-03-12 07:29:09 -07:00
Illia Silin	c62042b022	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit 4399ad790293bb76e4cf89f333017cf06a6d3e82. [ROCm/composable_kernel commit: `8cbcd3e0d0`]	2025-03-11 10:40:18 -07:00
Haocong WANG	b31ac9d14d	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `cbd74c2d12`]	2025-03-11 10:11:21 -07:00
Haocong WANG	6357ab039d	reduce test size to avoid timeout on specific silicon (#1966 ) [ROCm/composable_kernel commit: `ba209b9dab`]	2025-03-11 09:15:26 -07:00
Illia Silin	fa7f480e20	disable example_moe_gemm2_xdl_pk_i4 on gfx950 (#1968 ) [ROCm/composable_kernel commit: `aa42c3db06`]	2025-03-11 08:34:47 -07:00
carlushuang	569e9892a6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update [ROCm/composable_kernel commit: `7a93b16ff6`]	2025-03-11 21:07:40 +08:00
Mingtao Gu	68987f96d4	Ck int4 moe develop (#1949 ) * Add Gemm fp8xint4 example and kernel, function pass. * Init Gemm_fp8xint4 Bpreshuffle * Added gemm_fp8xint4_Bpreshuffle files, function not checked yet * General fix. * fp8xint4 bpreshuffle function pass * fix. * init b preshuffle dequant in VGPR. * fix bug, function pass. * move b thread dequant copy to blockwise. * fix bug, function now passes. * modified the tile size to 256, 128x128x128. * fixed a bug. * Initial int4 moe, compile pass, function not check. * fix bug in moe_gemm1.cpp, now function pass. * test expert = 8 and function pass. * Added moe_pk_i4_gemm2, function pass. * Added b preshuffle pipeline v3 support. * fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass. * Split the blockwise pipeline for fp8xint4. * commit missing files * opt gemm2 to 2x2 wave * fix swizzle = false * update int4 moe with latest input changes. * update tile size. * enable pipeline v3. * fix nswizzle = true * commit a version for compiler debug. * Updated transfer_v3r1_gather to support pk_i4_t type. * for int4 moe2 for type_convert support. * remove some values between mfma instructions. * fix int4 moe * Updated transfer_v3r1_gather to support pk_i4_t type. * i4 support lds multiple shuffle * fixed int4 moe tflops calculation. * Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle * updated gemm2. * change int4 moe example names * fix and format code. * format. * format codes. * update fp8xint4 example tile size. * add <unordered_map> header * fixed. * format. * Added conditional compilation for int4 -> fp8 conversion kernels --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `0db7c8f0b2`]	2025-03-10 11:16:44 +08:00
Thomas Ning	66663bf2d9	Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955 ) * tempsave, not selected * finish the feature and merge with develop --------- Co-authored-by: aska-0096 <haocwang@amd.com> [ROCm/composable_kernel commit: `c954bd0cfa`]	2025-03-07 13:44:06 -08:00
Thomas Ning	26ef4eed97	Fix on the error (#1956 ) [ROCm/composable_kernel commit: `9d51d17dd0`]	2025-03-07 13:43:52 -08:00
Illia Silin	fb928a315f	add missing headers (#1959 ) [ROCm/composable_kernel commit: `0e8e711ec8`]	2025-03-07 11:11:30 -08:00
Max Podkorytov	4517b3a8da	refactor ck-tile kernel launch (#1925 ) [ROCm/composable_kernel commit: `9e132eb77c`]	2025-03-07 08:29:40 -08:00
Qianfeng	7771db8ecf	Ck tile/complete k prefetch (#1941 ) * Re-implement qr_ks_vs_async pipeline by using kLoadOnce * Remove last block_sync_lds() in the loop * Tiny adjustment in qr_ks_vs_async pipeline for better performance * Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline * Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline * Use un-rolled gemm for Gemm-0 * Use k0_loops small tile load/store to replace the big tile load/store for K * Remove the commented lines in qx_ks_vs_custom_policy.hpp * Tune the prefetching of V in qr_ks_vs_async pipeline * Move the codes for storing the first v_lds tile some later * Let BlockDropout reuse LDS with V * Switch to separate code blocks according to iteration index * Interleave code blocks for better performance * Move clear_tile(s_acc) for better interleaving * Move code interleaving * Use MakeQDramTileDistribution for q_dram_window * Roll-back to load Q directly from global memory instead of using LDS as intermediary stop * Let V reuse the LDS of K * Use array of tiles to represent Q in vgprs * Use QLoadOnce == false for qr_ks_vs_async pipeline * Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline * Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile() * Move the definition of v_tiles out from the loop * Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile() * Remove using KLoadOnce in qx_ks_vs_custom_policy * Remove un-used get_slice_tile() call * Move the code line of clear_tile(s_acc) * Tune the lines of codes to make them more tidy * Re-arrange the codes before the main-loop * Add comments * Unify the alignment to be 8 for Q/K/V Lds decriptors * Tuning to K pre-loading * Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false * Adjust the pipeline codes * Use NumPrefetchV to separate from NumVLdsBuffers * Tune the location of a scheduler barrier code line * Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths * Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel * Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline * Move the location for store_tile() of first v_tile * Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline * Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<> * Try to fix old bugs in qx_ks_vs_custom_policy * Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler * Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy * Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV) * Tiny Fix (#1888) * Ck tile/paged attention workaround (#1894) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers * Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline * Remove one call of move_tile_window for q_dram_window * Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers() * Tiny fix in qr_ks_vs_whole_k_prefetch pipeline * Adjust the location of codes for storing the first V tile to LDS * Tiny fix and add comments * Change GetSmemKPackK size to improve performance * Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy * Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access * Adjustment in intermediate iteration codes for tiny performance improvement * Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation * Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue * Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer() * Remove useless AsyncopyV * Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used * Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy * Recover the qr_ks_vs_async pipeline * Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline * Revert "Try to fix old bugs in qx_ks_vs_custom_policy" This reverts commit `39b82ca194`. * Tiny fix with regard to whole_k_prefetch pipeline compiling * Update kPadSeqLenK setting in fmha_fwd_kernel * Use q_element_func and k_element_func * Use single q_tile rather than multiple sliced q_tiles * Codes refine according to the comments * Re-format one file * Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true [ROCm/composable_kernel commit: `4f54fa3058`]	2025-03-07 14:19:51 +08:00
Illia Silin	a03f2b367a	RE-enable DL and DPP instances by default. (#1954 ) * enable DL and DPP instances by default * fix cmake logic [ROCm/composable_kernel commit: `43c90b5234`]	2025-03-06 21:45:31 -08:00
Max Podkorytov	133ad1a3d8	Update CODEOWNERS (#1953 ) Add @tenpercent [ROCm/composable_kernel commit: `7a4a5d6c08`]	2025-03-06 17:38:29 -08:00
Juan Manuel Martinez Caamaño	74a4af680a	Fix typo: v_offset used in initialization of v_offset (#1951 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `0f62dd9928`]	2025-03-06 17:37:29 -08:00
kylasa	7fbcd06a62	Addressing (Post Merge) code review comments for PR 1845 (#1883 ) * Addressing code review comments. * Addressing code review comments. * Reorganized code for better readability. * add ck_tile gemms for new types in CI * fix jenkins syntax * fix script syntax * Add the test cases back * Address the review comments * Address review comments * clang format * Solve the merging issues * Addressed the comments * clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `66c5f5b0b6`]	2025-03-06 11:40:30 -08:00
carlushuang	710aa99819	[CK_TILE][HOTFIX] WA for address space by disable it completely (#1947 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format * WA for address space by disable it completely * hot fix moe gemm2 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `c12fb0a624`]	2025-03-06 12:01:25 +08:00
Illia Silin	9673ebcd71	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog [ROCm/composable_kernel commit: `a88bf76ecc`]	2025-03-05 14:33:28 -08:00
Adam Osewski	6d331b8641	[CK TILE] Fix KIterPerInnerLoop for block gemm. (#1934 ) * Fix KIterPerInnerLoop * Fix Kpack and KPerInnerLoop for block universal gemm. * Fix overlooked spelling bugs. [ROCm/composable_kernel commit: `4814db3905`]	2025-03-05 14:17:44 -08:00
Illia Silin	2bbbd3c6e0	remove support for gfx940 and gfx941 targets (#1944 ) * remove support for gfx940 and gfx941 targets * update changelog [ROCm/composable_kernel commit: `9b51c08bf7`]	2025-03-05 11:07:33 -08:00
asleepzzz	d0af93583a	Update CODEOWNERS (#1945 ) * Update CODEOWNERS Add new code owners * Update CODEOWNERS --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `d378233924`]	2025-03-05 10:29:51 -08:00
feli	dc656b04d8	ck moe gemm implement (#1936 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `3786e16375`]	2025-03-05 15:56:55 +08:00
jefyang1	80059eaa3b	Remove CK_USE_AMD_MFMA_GFX950 (#1935 ) * Add runtime check in example_gemm_xdl_streamk for gfx950 * Add runtime check in grouped conv fwd examples for gfx950 * Disable CK_USE_AMD_MFMA_GFX950 * Add new instances for gfx950 * Fix test_gemm_universal on gfx950 [ROCm/composable_kernel commit: `c95bda93ba`]	2025-03-04 10:32:25 -08:00
dependabot[bot]	5376fb1cf4	Bump rocm-docs-core from 1.17.0 to 1.17.1 in /docs/sphinx (#1937 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.17.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.17.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `540a6da40b`]	2025-03-03 22:37:30 -08:00
Juan Manuel Martinez Caamaño	96d4da0a5c	Missing _ in __HIPCC__ (#1930 ) [ROCm/composable_kernel commit: `57bb0e96a4`]	2025-03-03 08:19:47 -08:00
arai713	1b65b09009	MIGraphX hipRTC fix (#1923 ) * fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file) * fixed device op error: misplaced header guard * restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation * add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation * fixed conditional error in amd_ck_fp8.hpp * Add MaskOutUpperTriangle as a problem parameter to BatchedGemmSoftmaxGemm and disable tests with MaskOutUpperTriangle==True. Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com> [ROCm/composable_kernel commit: `fd06ed926c`]	2025-03-03 07:55:05 -08:00
asleepzzz	de1a2544c6	Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 )" (#1933 ) This reverts commit 06e1eee9bbb737f0ee3b2374f1857838c2b8ef3f. [ROCm/composable_kernel commit: `ef16010273`]	2025-03-03 07:17:39 -08:00
Bartłomiej Kocot	a81cf05757	[CK TILE] Fix double lds in ck tile gemm (#1924 ) [ROCm/composable_kernel commit: `1bf29478cd`]	2025-02-28 08:07:53 -08:00
rocking	4fbab6cbee	explicit show no feature in kernel name (#1920 ) [ROCm/composable_kernel commit: `faa2235dad`]	2025-02-28 14:23:30 +08:00
slippedJim	ddc3ff9878	make fmha bwd api template for v2 & v3 (#1918 ) * use template fmha_bwd function * update --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `a9bcd3c98d`]	2025-02-27 19:26:19 +08:00
Bartłomiej Kocot	c5acb522de	[CK TILE] Gemm pk_int4_t permute B (#1907 ) * [CK TILE] Gemm pk_int4_t permute B * Fixes [ROCm/composable_kernel commit: `0356ee069e`]	2025-02-27 11:01:14 +01:00
Bartłomiej Kocot	de16fbb133	[CK TILE] Block universal gemm lds<->vgpr optimizations (#1906 ) * [CK TILE] Block universal gemm lds<->vgpr optimizations * Rebase * Fixes [ROCm/composable_kernel commit: `bf1e17007e`]	2025-02-27 10:36:28 +01:00
rocking	482b340222	Apply filter to every kernel in the codgen of FMHA (#1911 ) * add receipt for fwd * Add receipt for bwd * Use kernel name to avoid more receipt * apply filter to every kernel [ROCm/composable_kernel commit: `e9ee568681`]	2025-02-26 20:20:29 +08:00
aledudek	185e48e164	[CK_TILE] Add EnvLogging and missing gemm args checks (#1896 ) * [CK_TILE] Add EnvLogging - refactor IsSupported error messages * [CK_TILE] Add EnvLogging - wrap gemm kernel error messages * [CK_TILE] Add EnvLogging - Add missing k_batch args check * [CK_TILE] Add EnvLogging - remove debug log * Add one check * [CK_TILE] EnvLogging - add CK_TILE_ERROR logs * [CK_TILE] EnvLogging quotes fix * [CK_TILE] EngLogging use function instead of macro for err logs * [CK_TILE] EnvLogging - refactor checking env var [ROCm/composable_kernel commit: `c9bcfd755e`]	2025-02-25 11:48:38 +01:00
carlushuang	581c75f3b7	[CK_TILE] add moe-sorting MP kernel (#1910 ) * moe sorting ex * fix bug for race condition * fix bug and optimze large expert * fix * optimize with sub_token_oneshot * support skip empty tokens for expert sorting * update moe_sorting * tidy code * support mp kernel * hint mp * remove use less code * porting to example 15 --------- Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `353a612b44`]	2025-02-25 17:56:55 +08:00
Haocong WANG	f43d87b09c	[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> [ROCm/composable_kernel commit: `020148d0f7`]	2025-02-25 15:42:20 +08:00
coconutruben	828eaf2e5f	device_prop.hpp - replace map with compile time hash and switch (#1898 ) * device_prop.hpp - replace map with compile time hash and switch Summary: We replace a static const map with a compile time hash function and a switch statement to achieve the same goal: translate names to architectures. Most of these are very old, however the function needs to continue to work. Why? because the static map can cause issues when compiling into libraries that get dynamically loaded/unloaded, leading to memory corruption Test Plan: Running pytorch `torch.compile()` with CK enabled, and seeing it not segfault on the 2nd kernel (1st reload of the library) Reviewers: Subscribers: Tasks: Tags: * clang-format [ROCm/composable_kernel commit: `fcd4a6f3d1`]	2025-02-24 09:57:55 -08:00

1 2 3 4 5 ...

1748 Commits