composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Max Podkorytov	66c5b715c9	remove dead code	2025-01-08 10:09:54 -08:00
Max Podkorytov	edb78a4729	clang-format and remove dead code	2025-01-08 10:09:54 -08:00
Max Podkorytov	60113859fa	roll back splitkv	2025-01-08 10:09:54 -08:00
Max Podkorytov	bfc997a7e6	update qsksvs pipeline	2025-01-08 10:09:54 -08:00
Max Podkorytov	f7942b993c	qsksvs pipeline changes to mirror qrksvs	2025-01-08 10:09:54 -08:00
AMD-dteng	d5c8a334ca	enable bias feature that add bias before adding residual (for rtpllm project) (#1741 ) * 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16 * delete comment * 1.remove fmha change 2.change buffer name from bias to xbias * Now bias can be used independently from fadd * change kbias to kxbias --------- Co-authored-by: feli <felix.li@amd.com>	2025-01-08 17:51:06 +08:00
Po Yen Chen	24b12d04af	[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789 ) * Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2025-01-07 18:49:24 +08:00
carlushuang	6df5fe2ad8	[CK_TILE]naive attn support FP8 KVCache quant (#1747 ) * quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-01-03 18:43:07 +08:00
feli	4bc610416a	Ck tile/layernorm: implement naive reduce, opt performance (#1784 ) * add no welford * enable output raw * raw of int8 * fix build * fix smoke test err * [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate * [cktile]layernorm, fix err commit files and remove uselss * fix quant 8192 err & change norm_reduce class and file name --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2025-01-03 14:28:59 +08:00
Qianfeng	4e076909b6	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex	2024-12-29 14:29:56 +08:00
Bartłomiej Kocot	af66494880	[CK TILE] GEMM and Batched GEMM SplitK support (#1724 ) * [CK TILE] Add split K support in GEMM * Updates * Fixes * rebase * fix * Fix * fixes * support for batched gemm	2024-12-28 14:40:17 +01:00
carlushuang	3d15f364b3	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code	2024-12-23 10:59:02 +08:00
carlushuang	1c45ca35dd	hot-fix (#1768 )	2024-12-20 16:40:45 +08:00
Po Yen Chen	37cdbf4f0e	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2024-12-20 14:41:01 +08:00
Mateusz Ozga	e758d006a5	Apply Ck-tile argument parser for vectors [I/O] (#1758 ) * Parser for a vector was added. Additionaly we valid correctnes of numbers * Remove unnecessary comments * Review part 1 * Review part 2 * Add const to variadic lambda * Rename C->K	2024-12-19 17:55:35 +01:00
aledudek	453ca37347	[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730 ) * Gemm Kernel Refactor part1 * Gemm Kernel Refactor common gemm pipeline part2 * [CK TILE] Refactor batched gemm to reuse GemmKernel * [CK TILE] Refactor GemmKernel - review changes part1 * [CK TILE] Refactor GemmKernel - references fix * [CK TILE] Refactor GemmKernel - naming changes, add problem * [CK_TILE] Refactor GemmKernel - update tests * [CK_TILE] Refactor GemmKernel - review changes * [CK_TILE] Refactor GemmKernel - update test * [CK_TILE] Refactor GemmKernel - constness fixes * [CK_TILE] Refactor GemmKernel - update tests	2024-12-18 17:52:46 +01:00
Xiaodong Wang	1c1b336371	Disambiguate bit_cast (#1749 ) Adding namespace to disambiguate with std::bit_cast Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-12-18 18:32:38 +08:00
aledudek	f6c4d614e3	[CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743 ) * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix	2024-12-18 09:45:58 +01:00
Adam Osewski	d46196f291	Enhance printing functionality (#1751 ) * Added object print with all template parameters * fix clang format --------- Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-12-17 09:19:44 +01:00
Xu, Shengnan	f57d720c67	added moe interleaving pipeline (#1712 ) * added moe interleaving pipeline * remove redundant code * formater --------- Co-authored-by: root <root@hjbog-srdc-14.amd.com>	2024-12-15 20:13:10 +08:00
chenjun	4e73177684	Ck tile/smoothquant out stride (#1742 ) * add ck_tile/smoothquant out stride parameter * Remove the default stride value --------- Co-authored-by: so <a.com>	2024-12-13 11:53:52 +08:00
carlushuang	77a38e0211	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err	2024-12-12 11:54:03 +08:00
Po Yen Chen	58e7f37fc8	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 )	2024-12-06 12:59:58 +08:00
jakpiase	feb9a2bd9b	Add IsSupportedArgument() to gemm_kernel (#1698 ) * add IsSupportedArgument to gemm_kernel * add ut and do some refactoring * switched to ck_tile's integral_constant	2024-12-05 09:02:13 +01:00
Mateusz Ozga	4cb3d7d7ea	Ck tile grouped GEMM example (#1713 ) * Ck-tile, impl. grouped gemm * Workspace is allocated by user, and is passed to the function * Prepare test to new api design * Unify GemTransKernelArgs, removing N0 param * Add 1 to dim3 in paritioner * Typo: gem - > gemm --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-12-04 21:40:01 +01:00
Po Yen Chen	126ce85aa1	[CK_TILE] Use 'false' for highest dimension padding flags (#1716 ) * Use 'false' for highest dimension padding flags * Update padding flag of bias	2024-12-04 15:59:58 +08:00
Bartłomiej Kocot	5affda819d	Add basic documentation structure (#1715 ) * Add basic documentation structure * Add terminology placeholder * Add codegen placeholder * Create template for each page	2024-12-04 00:46:47 +01:00
Bartłomiej Kocot	cff7fab798	[CK TILE] Fix universal gemm template keywords (#1704 )	2024-11-29 20:51:09 -08:00
aledudek	78f0fea08e	Ck tile batched gemm example (#1615 ) * [CK Tile] Batched GEMM Example * [CK Tile] Batched GEMM Example - minor refactor * [CK Tile] Batched GEMM Example - README update * [CK Tile] Batched Gemm Example - review changes - Added tensor data layours as input parameters - Changed structure of Host and Kernel args - Removed bug with invalid vector read on non-contiguous memory * [CK Tile] Batched Gemm Example - remove comment * [CK Tile] Batched Gemm Example - Add GTests part1 * [CK Tile] Batched Gemm Example - GTests part2 + review changes * [CK TILE] Batched GEMM post merge fixes * [CK Tile] Batched GEMM Example - fix pad views	2024-11-29 11:52:18 +01:00
Bartłomiej Kocot	f49b595dc0	[CK TILE] Add gemm compute pipeline v3 (#1661 ) * [CK TILE] Add gemm compute pipeline v3 * Enable universal gemm compute pipeline. * Rename example and add compute pipeline. * Introduce ag bg cr pipeline impl base. * Refactor to reuse code. * Cleaning * Formatting. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-11-28 17:51:49 +01:00
jakpiase	e7b6286441	Add interwave scheduler for gemm mem pipeline (#1647 ) * add interwave scheduler for gemm mem pipeline * Fix merge artifacts. * Refactor unit tests. * Switch to interwave scheduler for mem example --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-11-27 18:25:07 +01:00
rocking	abae2afc72	support max3 in smoothquant and add+ rmsnorm + rdquant (#1654 ) * Fix cmake example build * Support max3 in smoothquant one pass * support max3 in two pass * support max3 in add_rmsnorm_rdquant	2024-11-27 05:01:15 +08:00
Po Yen Chen	cf2d635ea2	[CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688 ) * Allow getting batch size from splitkv tile partitioner * Fix wrong paged-kvcache impl for group mode * Fix wrong example code for page-kvcache * Undo changes in fmha_fwd.cpp * Always use 2D block table * Add is_gappy kernel argument for paged-kvcache The is_gappy argument is used for differentiating seqstart_k_ptr usage in flash-attention & xformers * Remove out-of-date comments * Remove no-longer used method * Fix wrong # page-block calculation * Fix wrong comment --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2024-11-26 20:37:54 +08:00
Adam Osewski	b6bcd76d88	CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676 ) * Block universal gemm. * Universal block gemm with interwave scheduler - draft. * Refactoring * Move a/b_warp_tiles into BlockGemmImpl * set BlockGemmImpl as a class member * Change tile size for more suitable to memory bound cases. * Introduce kKPerThread to WarpGemm * Add documentation comment. * Fix Interwave scheduler block gemm. * Add compute/memory friendly tile configuration. * Clean * New tile configurations in gemm mem example. * Add more static checks and fix loop order in block gemm. * Add more static checks and use warp gemm mfma dispatcher. * Add default scheduler block gemm. * Remove logging in example.	2024-11-26 08:45:14 +01:00
carlushuang	440e28b08f	[CK_TILE] fused-moe first version (#1634 ) * moe pipeline * update code * compile OK * update * update cpu reference * update pipeline_gemm0 * compiler ok * update pipeline * rename to ex pipeline * block-asm * update * update * update first gemm ok * compute correct * update file structure * update README * update * update * update code * update API * return unsupport case * add comment * update readme * update * uncomment * update * fix build err --------- Co-authored-by: valarLip <340077269@qq.com>	2024-11-26 11:14:56 +08:00
Po Yen Chen	645fe812f6	[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689 ) * Fix mis-matched tuple<> elem types * Rename MakeKargs() as MakeKargsImpl() --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2024-11-25 15:30:35 +08:00
carlushuang	36c7ce4e0e	[CK_TILE]Moe update index (#1672 ) * update MOCK_ID for moe-sorting * add moe-smoothquant * update a comment * fix format * hot fix * update topk in overflow case * update comments * update bf16 cvt --------- Co-authored-by: valarLip <340077269@qq.com>	2024-11-25 13:12:35 +08:00
Qianfeng	ce2bdf42a9	Change in fwd-splitkv kernel to support num_splits=1 case (#1690 ) * Change in fwd-splitkv kernel to support num_splits=1 case * Update in codegen fwd-splitkv to make num_splits > 1 cases pass * Specify instance traits in dispatch * Fix link error for fp8 kernels --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-25 12:31:38 +08:00
schung-amd	ff92222f93	[CK_TILE] MakeKargs overloads for backward compatibility (#1681 ) * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Re-format files using ck_tile remod.py --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-23 06:51:35 +08:00
Po Yen Chen	fb1ccfa9df	[CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678 ) * Generate group mode paged-attn kernel * Enable paged-kvcache + group mode support * Add missing header: fused_moe.hpp * Add comment to explain kernel arg usage * Make error message more clear * Add comment for confusing data member names * Add more comment for confusing variable names * Fix typo in option description	2024-11-21 14:53:10 +08:00
feli	c1f8d53ce8	[Ck_tile] hot fix, fix rpcf param setting err (#1657 ) Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-11-14 14:06:36 +08:00
Bartłomiej Kocot	d20735691c	[CK TILE] Update gemm universal pipeline (#1644 ) * [CK TILE] Update gemm universal pipeline * Fixes * fix * Rebase	2024-11-13 11:46:18 +01:00
Thomas Ning	2b6458ddf2	[CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651 ) * Finished the feature * Modified the test file * Test case update * addresss comment * Addressed the review comment * Fixed the CI error	2024-11-11 18:08:25 -08:00
valarLip	8ef8a994e7	[CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650 ) * [CK_TILE] add more stride for layernorm to support un-continuous Tensor * align CK coding style * extend strides to layernrom expample * clang-format...	2024-11-11 16:02:28 +08:00
Po Yen Chen	13332998a4	Return nullptr when block index is invalid (#1649 )	2024-11-11 09:28:32 +08:00
dummycoderfe	bec6fbc65f	Ck tile/moe sorting (#1624 ) * add moe_sorting & check ok * fix comments & typo * Run remod.py under include/ck_tile & example/ck_tile directories * format codes * fix output ci check bug * fix moe sorting readme and error commit file * use magiv div to accelerate compute * add an loop unroll for moe lds ops * add extblocksnel to set zeros for moebufs * [Ck_tile] moe set zero run ok, add size check and fix ref check * [Ck_tile]fix moe_sorting fuse set_zero remod * [Ck_tile] change name style, fix zero buffer size err, change folder * [Ck_tile] moe_sorting: fix name style * [Ck_tile] moe_sorting, remove useless params in traits * [Ck_tile] change outputtile cnt * unit_size; change output buf alloc --------- Co-authored-by: dummycoderfe <noplydummmycoder@163.com> Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-11-09 17:57:27 +08:00
dummycoderfe	686a58a912	[Ck tile] layernorm2d fwd optimize (#1637 ) * optimze small N case using vec io and using rcp div * [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass * [Ck_tile] fix blockSize compute in Generic2dBlockShape * [Ck_tile]fix kfastfdiv template style * [Ck_tile] layernorm, fix stype in review --------- Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-11-08 12:28:23 +08:00
Illia Silin	75c5bfa364	enable compilation for generic navi targets (#1645 )	2024-11-07 14:14:42 -08:00
carlushuang	cb6c5d39dc	[CK_TILE] layernorm have more accurate residual (#1623 ) * more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-02 13:30:16 +08:00
rocking	fbd654545a	[Ck_tile] smoothquant (#1617 ) * fix compile error * fix typo of padding * Add smoothquant op * Add smoothquant instance library * refine type * add test script * Re-generate smoothquant.hpp * Always use 'current year' in copyright * use Generic2dBlockShape instead * Add vector = 8 instance back * Find exe path automatically * Simplify the api condition * Remove debugging code * update year * Add blank line between function declaration * explicitly cast return value to dim3 * refine return value * Fix default warmup and repeat value * Add comment * refactor sommthquant cmake * Add README * Fix typo --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>	2024-11-01 13:51:56 +08:00

1 2 3

109 Commits