composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 18:42:06 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	fd25f5df05	[ck_tile] Merge get_partition_index() and get_partition_index_v2() to get_partition_index() with bool_constant parameter	2025-08-08 06:22:11 +00:00
Qianfeng Zhang	40261225e8	[ck_tile] Add get_partition_index_v2 which uses warp_id in vgpr and to be used by tile_windows on lds-based tensor_view	2025-08-06 15:05:08 +00:00
Qianfeng Zhang	ae05715998	[ck_tile] Remove useless code lines in make_wave_buffer_resource	2025-08-04 09:40:22 +00:00
Qianfeng Zhang	4ae9acd712	Revert "Update exp() in ck_tile/core/numeric/math.hpp to use __expf" This reverts commit `7316a44ff3`.	2025-04-25 14:42:33 +00:00
Qianfeng Zhang	7316a44ff3	Update exp() in ck_tile/core/numeric/math.hpp to use __expf	2025-04-22 14:46:53 +00:00
Rostyslav Geyyer	441343a23d	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround	2025-03-26 19:23:01 -05:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Andriy Roshchenko	72d888821c	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales	2025-03-25 15:33:03 -06:00
Max Podkorytov	1a58522f01	use fast path for sequence generation in old CK (#1993 )	2025-03-25 11:28:44 -07:00
ruanjm	d49abdaa87	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx	2025-03-25 20:09:45 +08:00
Illia Silin	d2eab23958	Split up data_type header. (#1996 ) * split fp64 vector data type * add missing header * move e8m0 structs * split off numeric_utils header * fix typo * split off numeric limits header * update data_type header * fix clang format * split off vector type header * fix clang format * fix typo for binary_inf	2025-03-24 15:08:54 -07:00
Andriy Roshchenko	6660dc6b8e	Introduce MX GEMM for FP8 data type (#2000 )	2025-03-24 15:41:07 -06:00
MHYang-gh	c027637a8f	Fix A/B lds transform (#2007 )	2025-03-22 23:13:50 -07:00
Bartłomiej Kocot	5b0873c31a	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments	2025-03-22 23:19:49 +01:00
BingYuan.Zhou	5a0d693b86	fix ck_tile/basic_gemm build error (#1988 )	2025-03-20 22:01:14 -07:00
Attila T. Áfra	c79bf11148	Fix compile errors on Windows and Linux (#2002 ) * Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous) * Fix compile error (no matching function for call to 'cast_to_f32_from_f8')	2025-03-20 12:37:25 -07:00
carlushuang	e3c9886cdf	[CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982 ) * return value with macro and revert the return value * [CK-TILE] no-macro launch api solution (#1992) * no-macro solution * address -Wcomma --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-03-20 11:00:29 -07:00
jakpiase	0e91d32c61	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests	2025-03-20 11:17:04 +01:00
rocking	b819c217e4	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-20 00:06:45 +08:00
felix	7eaedeb36c	Ck moe hot fix (#1979 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * clang format * rm useless comments --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-03-19 22:58:27 +08:00
aledudek	5095906975	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue	2025-03-17 16:42:43 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00
carlushuang	3e81279d26	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `8cbcd3e0d0`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-13 11:41:39 +08:00
feli	251afab3b7	ck_moe: fix useless code and remove usless oob (#1972 ) * fix useless code and remove usless oob * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-12 09:22:42 -07:00
Illia Silin	4c97cc511e	use old instrinsics with staging compiler (#1970 )	2025-03-12 07:29:09 -07:00
Illia Silin	8cbcd3e0d0	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `7a93b16ff6`.	2025-03-11 10:40:18 -07:00
Haocong WANG	cbd74c2d12	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-11 10:11:21 -07:00
carlushuang	7a93b16ff6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update	2025-03-11 21:07:40 +08:00
Mingtao Gu	0db7c8f0b2	Ck int4 moe develop (#1949 ) * Add Gemm fp8xint4 example and kernel, function pass. * Init Gemm_fp8xint4 Bpreshuffle * Added gemm_fp8xint4_Bpreshuffle files, function not checked yet * General fix. * fp8xint4 bpreshuffle function pass * fix. * init b preshuffle dequant in VGPR. * fix bug, function pass. * move b thread dequant copy to blockwise. * fix bug, function now passes. * modified the tile size to 256, 128x128x128. * fixed a bug. * Initial int4 moe, compile pass, function not check. * fix bug in moe_gemm1.cpp, now function pass. * test expert = 8 and function pass. * Added moe_pk_i4_gemm2, function pass. * Added b preshuffle pipeline v3 support. * fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass. * Split the blockwise pipeline for fp8xint4. * commit missing files * opt gemm2 to 2x2 wave * fix swizzle = false * update int4 moe with latest input changes. * update tile size. * enable pipeline v3. * fix nswizzle = true * commit a version for compiler debug. * Updated transfer_v3r1_gather to support pk_i4_t type. * for int4 moe2 for type_convert support. * remove some values between mfma instructions. * fix int4 moe * Updated transfer_v3r1_gather to support pk_i4_t type. * i4 support lds multiple shuffle * fixed int4 moe tflops calculation. * Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle * updated gemm2. * change int4 moe example names * fix and format code. * format. * format codes. * update fp8xint4 example tile size. * add <unordered_map> header * fixed. * format. * Added conditional compilation for int4 -> fp8 conversion kernels --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-10 11:16:44 +08:00
Thomas Ning	9d51d17dd0	Fix on the error (#1956 )	2025-03-07 13:43:52 -08:00
Max Podkorytov	9e132eb77c	refactor ck-tile kernel launch (#1925 )	2025-03-07 08:29:40 -08:00
Qianfeng	4f54fa3058	Ck tile/complete k prefetch (#1941 ) * Re-implement qr_ks_vs_async pipeline by using kLoadOnce * Remove last block_sync_lds() in the loop * Tiny adjustment in qr_ks_vs_async pipeline for better performance * Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline * Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline * Use un-rolled gemm for Gemm-0 * Use k0_loops small tile load/store to replace the big tile load/store for K * Remove the commented lines in qx_ks_vs_custom_policy.hpp * Tune the prefetching of V in qr_ks_vs_async pipeline * Move the codes for storing the first v_lds tile some later * Let BlockDropout reuse LDS with V * Switch to separate code blocks according to iteration index * Interleave code blocks for better performance * Move clear_tile(s_acc) for better interleaving * Move code interleaving * Use MakeQDramTileDistribution for q_dram_window * Roll-back to load Q directly from global memory instead of using LDS as intermediary stop * Let V reuse the LDS of K * Use array of tiles to represent Q in vgprs * Use QLoadOnce == false for qr_ks_vs_async pipeline * Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline * Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile() * Move the definition of v_tiles out from the loop * Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile() * Remove using KLoadOnce in qx_ks_vs_custom_policy * Remove un-used get_slice_tile() call * Move the code line of clear_tile(s_acc) * Tune the lines of codes to make them more tidy * Re-arrange the codes before the main-loop * Add comments * Unify the alignment to be 8 for Q/K/V Lds decriptors * Tuning to K pre-loading * Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false * Adjust the pipeline codes * Use NumPrefetchV to separate from NumVLdsBuffers * Tune the location of a scheduler barrier code line * Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths * Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel * Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline * Move the location for store_tile() of first v_tile * Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline * Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<> * Try to fix old bugs in qx_ks_vs_custom_policy * Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler * Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy * Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV) * Tiny Fix (#1888) * Ck tile/paged attention workaround (#1894) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers * Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline * Remove one call of move_tile_window for q_dram_window * Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers() * Tiny fix in qr_ks_vs_whole_k_prefetch pipeline * Adjust the location of codes for storing the first V tile to LDS * Tiny fix and add comments * Change GetSmemKPackK size to improve performance * Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy * Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access * Adjustment in intermediate iteration codes for tiny performance improvement * Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation * Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue * Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer() * Remove useless AsyncopyV * Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used * Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy * Recover the qr_ks_vs_async pipeline * Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline * Revert "Try to fix old bugs in qx_ks_vs_custom_policy" This reverts commit `39b82ca194`. * Tiny fix with regard to whole_k_prefetch pipeline compiling * Update kPadSeqLenK setting in fmha_fwd_kernel * Use q_element_func and k_element_func * Use single q_tile rather than multiple sliced q_tiles * Codes refine according to the comments * Re-format one file * Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true	2025-03-07 14:19:51 +08:00
Juan Manuel Martinez Caamaño	0f62dd9928	Fix typo: v_offset used in initialization of v_offset (#1951 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-06 17:37:29 -08:00
kylasa	66c5f5b0b6	Addressing (Post Merge) code review comments for PR 1845 (#1883 ) * Addressing code review comments. * Addressing code review comments. * Reorganized code for better readability. * add ck_tile gemms for new types in CI * fix jenkins syntax * fix script syntax * Add the test cases back * Address the review comments * Address review comments * clang format * Solve the merging issues * Addressed the comments * clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-03-06 11:40:30 -08:00
carlushuang	c12fb0a624	[CK_TILE][HOTFIX] WA for address space by disable it completely (#1947 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format * WA for address space by disable it completely * hot fix moe gemm2 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com>	2025-03-06 12:01:25 +08:00
Illia Silin	a88bf76ecc	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog	2025-03-05 14:33:28 -08:00
Adam Osewski	4814db3905	[CK TILE] Fix KIterPerInnerLoop for block gemm. (#1934 ) * Fix KIterPerInnerLoop * Fix Kpack and KPerInnerLoop for block universal gemm. * Fix overlooked spelling bugs.	2025-03-05 14:17:44 -08:00
Illia Silin	9b51c08bf7	remove support for gfx940 and gfx941 targets (#1944 ) * remove support for gfx940 and gfx941 targets * update changelog	2025-03-05 11:07:33 -08:00
feli	3786e16375	ck moe gemm implement (#1936 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-05 15:56:55 +08:00
jefyang1	c95bda93ba	Remove CK_USE_AMD_MFMA_GFX950 (#1935 ) * Add runtime check in example_gemm_xdl_streamk for gfx950 * Add runtime check in grouped conv fwd examples for gfx950 * Disable CK_USE_AMD_MFMA_GFX950 * Add new instances for gfx950 * Fix test_gemm_universal on gfx950	2025-03-04 10:32:25 -08:00
Juan Manuel Martinez Caamaño	57bb0e96a4	Missing _ in __HIPCC__ (#1930 )	2025-03-03 08:19:47 -08:00
arai713	fd06ed926c	MIGraphX hipRTC fix (#1923 ) * fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file) * fixed device op error: misplaced header guard * restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation * add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation * fixed conditional error in amd_ck_fp8.hpp * Add MaskOutUpperTriangle as a problem parameter to BatchedGemmSoftmaxGemm and disable tests with MaskOutUpperTriangle==True. Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>	2025-03-03 07:55:05 -08:00
asleepzzz	ef16010273	Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 )" (#1933 ) This reverts commit `020148d0f7`.	2025-03-03 07:17:39 -08:00
Bartłomiej Kocot	1bf29478cd	[CK TILE] Fix double lds in ck tile gemm (#1924 )	2025-02-28 08:07:53 -08:00
Bartłomiej Kocot	0356ee069e	[CK TILE] Gemm pk_int4_t permute B (#1907 ) * [CK TILE] Gemm pk_int4_t permute B * Fixes	2025-02-27 11:01:14 +01:00
Bartłomiej Kocot	bf1e17007e	[CK TILE] Block universal gemm lds<->vgpr optimizations (#1906 ) * [CK TILE] Block universal gemm lds<->vgpr optimizations * Rebase * Fixes	2025-02-27 10:36:28 +01:00
aledudek	c9bcfd755e	[CK_TILE] Add EnvLogging and missing gemm args checks (#1896 ) * [CK_TILE] Add EnvLogging - refactor IsSupported error messages * [CK_TILE] Add EnvLogging - wrap gemm kernel error messages * [CK_TILE] Add EnvLogging - Add missing k_batch args check * [CK_TILE] Add EnvLogging - remove debug log * Add one check * [CK_TILE] EnvLogging - add CK_TILE_ERROR logs * [CK_TILE] EnvLogging quotes fix * [CK_TILE] EngLogging use function instead of macro for err logs * [CK_TILE] EnvLogging - refactor checking env var	2025-02-25 11:48:38 +01:00
carlushuang	353a612b44	[CK_TILE] add moe-sorting MP kernel (#1910 ) * moe sorting ex * fix bug for race condition * fix bug and optimze large expert * fix * optimize with sub_token_oneshot * support skip empty tokens for expert sorting * update moe_sorting * tidy code * support mp kernel * hint mp * remove use less code * porting to example 15 --------- Co-authored-by: valarLip <340077269@qq.com>	2025-02-25 17:56:55 +08:00
Haocong WANG	020148d0f7	[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com>	2025-02-25 15:42:20 +08:00
coconutruben	fcd4a6f3d1	device_prop.hpp - replace map with compile time hash and switch (#1898 ) * device_prop.hpp - replace map with compile time hash and switch Summary: We replace a static const map with a compile time hash function and a switch statement to achieve the same goal: translate names to architectures. Most of these are very old, however the function needs to continue to work. Why? because the static map can cause issues when compiling into libraries that get dynamically loaded/unloaded, leading to memory corruption Test Plan: Running pytorch `torch.compile()` with CK enabled, and seeing it not segfault on the 2nd kernel (1st reload of the library) Reviewers: Subscribers: Tasks: Tags: * clang-format	2025-02-24 09:57:55 -08:00

1 2 3 4 5 ...

722 Commits