composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
aska-0096	80e89ebdd5	minimum reproducable example for warpspecialized scheduling	2025-01-24 08:44:29 +00:00
Haocong WANG	af30d6b614	Merge pull request #1838 from ROCm/cka8w8_uc_newpipe Cka8w8 uc newpipe	2025-01-23 18:28:14 +08:00
aska-0096	add0b2225d	clean the code	2025-01-23 10:26:09 +00:00
aska-0096	115c7505e5	fix Odd Mrepeat number pipelinev3; Add v3 instances to ckProfiler	2025-01-23 08:34:22 +00:00
aska-0096	d47461d727	Add compute-friendly pipeline for bpreshuffle case; remove enable-post-misched=0 flag.	2025-01-22 09:21:53 +00:00
aska-0096	cee23c47ff	tempsave	2025-01-17 06:37:50 +00:00
aska-0096	35ba08646f	fp8 add_rmsnorm_dynamic_dequant	2025-01-10 11:12:16 +00:00
aska-0096	487a05d612	refine blockgemm pipeline version as base struct.	2025-01-08 14:27:42 +00:00
aska-0096	22fe522d0c	optimize software pipeline	2025-01-08 09:28:32 +00:00
aska-0096	0dbe537032	refine weight preshuffle format.	2025-01-02 13:59:58 +00:00
aska-0096	72c1ddacb9	Merge branch 'add_a8w8_preshuffle_ckprofiler' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc	2024-12-31 07:23:50 +00:00
aska-0096	6f24c2d814	disable N, K Padding, splitk enabled	2024-12-31 06:31:06 +00:00
aska-0096	f60f9d5917	sanity pass, most tile size enabled. TODO: NWave!=4	2024-12-30 18:22:08 +00:00
aska-0096	482ca684ba	Merge branch 'dev/a8w8_b_preshuffle' of https://github.com/ROCm/composable_kernel into add_a8w8_preshuffle_ckprofiler	2024-12-30 09:21:35 +00:00
aska-0096	74ef5021b6	tempsave	2024-12-30 09:20:25 +00:00
coderfeli	db84352941	fix warnings and revert cmake and fix clang format	2024-12-30 08:24:50 +00:00
coderfeli	5765ba51ce	auto calculate hard code params	2024-12-30 07:59:47 +00:00
coderfeli	3f9dbcac63	use new pipeline for b preshuffle, run ok; revert olds to fix ckprofiler	2024-12-30 06:52:10 +00:00
coderfeli	54f44e6232	fix brepeat, kloop and lds two buffer; works ok now	2024-12-30 00:25:46 +00:00
coderfeli	2c056624af	fix tail	2024-12-27 08:30:03 +00:00
coderfeli	174b46b04a	add cpu shuffle	2024-12-27 07:31:14 +00:00
coderfeli	c8d9660f3b	using develop branch timer	2024-12-27 06:47:36 +00:00
coderfeli	031ddf356d	fix performance regression on blockgemm v3 pipe	2024-12-27 06:40:43 +00:00
coderfeli	400cac2839	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8	2024-12-27 05:42:38 +00:00
aska-0096	7cec63a631	remove agpr usage when vgpr usage <256	2024-12-27 03:09:26 +00:00
coderfeli	e6f5a78b14	add double buffer scratch	2024-12-26 15:02:04 +00:00
coderfeli	3784329b68	can run	2024-12-26 13:01:07 +00:00
coderfeli	4a1ec81595	add bypass logic and build	2024-12-26 10:05:25 +00:00
aska-0096	1a089f6f88	sanity bug fix	2024-12-26 10:05:17 +00:00
carlushuang	3d15f364b3	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code	2024-12-23 10:59:02 +08:00
Illia Silin	07339c7383	fix typo for CK_USE_OCP_FP8 (#1769 )	2024-12-20 07:52:24 -08:00
carlushuang	1c45ca35dd	hot-fix (#1768 )	2024-12-20 16:40:45 +08:00
Po Yen Chen	37cdbf4f0e	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2024-12-20 14:41:01 +08:00
Mateusz Ozga	e758d006a5	Apply Ck-tile argument parser for vectors [I/O] (#1758 ) * Parser for a vector was added. Additionaly we valid correctnes of numbers * Remove unnecessary comments * Review part 1 * Review part 2 * Add const to variadic lambda * Rename C->K	2024-12-19 17:55:35 +01:00
aledudek	453ca37347	[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730 ) * Gemm Kernel Refactor part1 * Gemm Kernel Refactor common gemm pipeline part2 * [CK TILE] Refactor batched gemm to reuse GemmKernel * [CK TILE] Refactor GemmKernel - review changes part1 * [CK TILE] Refactor GemmKernel - references fix * [CK TILE] Refactor GemmKernel - naming changes, add problem * [CK_TILE] Refactor GemmKernel - update tests * [CK_TILE] Refactor GemmKernel - review changes * [CK_TILE] Refactor GemmKernel - update test * [CK_TILE] Refactor GemmKernel - constness fixes * [CK_TILE] Refactor GemmKernel - update tests	2024-12-18 17:52:46 +01:00
Xiaodong Wang	1c1b336371	Disambiguate bit_cast (#1749 ) Adding namespace to disambiguate with std::bit_cast Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-12-18 18:32:38 +08:00
aledudek	f6c4d614e3	[CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743 ) * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix	2024-12-18 09:45:58 +01:00
Illia Silin	689a5ae45b	Pass build flags to config.h (#1760 ) * pass the build flags to config.h * fix clang format	2024-12-17 10:17:29 -08:00
Max Podkorytov	6ef8d3c295	refactor conditional usage; fix build on rocm6.1 where the reference didn't exist	2024-12-17 08:40:18 -08:00
Adam Osewski	d46196f291	Enhance printing functionality (#1751 ) * Added object print with all template parameters * fix clang format --------- Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-12-17 09:19:44 +01:00
Xu, Shengnan	f57d720c67	added moe interleaving pipeline (#1712 ) * added moe interleaving pipeline * remove redundant code * formater --------- Co-authored-by: root <root@hjbog-srdc-14.amd.com>	2024-12-15 20:13:10 +08:00
Bartłomiej Kocot	4d8fce33dd	Add SplitK support into Batched GEMM V3 (#1729 ) * add bmm api * add bf16 multi_d * add ckProfiler for bf16 * add ckProfiler files * add more instance; fixed 64bit index issue * fixed naming * enabled batched Ds * use long_index for ds offsets * clean * add bmm fp8 ckProfiler * Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/src/profile_gemm_universal_batched.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * clean * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * refactor batch offset func * add splitk suppport into bmm_v3 * clean * clean * format * fixed * fix --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2024-12-13 21:08:35 +01:00
aska-0096	c8c016ddad	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8	2024-12-13 09:18:50 +00:00
chenjun	4e73177684	Ck tile/smoothquant out stride (#1742 ) * add ck_tile/smoothquant out stride parameter * Remove the default stride value --------- Co-authored-by: so <a.com>	2024-12-13 11:53:52 +08:00
aska-0096	e8ca3daf4e	update instances	2024-12-13 03:29:15 +00:00
carlushuang	77a38e0211	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err	2024-12-12 11:54:03 +08:00
Jatin Chaudhary	67497a044d	Make sure we call __hneg with half to remove ambigios error (#1736 )	2024-12-10 08:47:36 -08:00
Bartłomiej Kocot	261f1759de	Support large batch tensors in grouped conv bwd data (#1711 ) * Support large batch tensors in grouped conv bwd data * Fix multiD * fixes * fixes * fixes	2024-12-06 10:55:23 +01:00
Po Yen Chen	58e7f37fc8	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 )	2024-12-06 12:59:58 +08:00
jakpiase	feb9a2bd9b	Add IsSupportedArgument() to gemm_kernel (#1698 ) * add IsSupportedArgument to gemm_kernel * add ut and do some refactoring * switched to ck_tile's integral_constant	2024-12-05 09:02:13 +01:00

1 2 3 4 5 ...

661 Commits