composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 13:11:25 +00:00

Author	SHA1	Message	Date
carlushuang	1c45ca35dd	hot-fix (#1768 )	2024-12-20 16:40:45 +08:00
Po Yen Chen	37cdbf4f0e	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2024-12-20 14:41:01 +08:00
aledudek	453ca37347	[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730 ) * Gemm Kernel Refactor part1 * Gemm Kernel Refactor common gemm pipeline part2 * [CK TILE] Refactor batched gemm to reuse GemmKernel * [CK TILE] Refactor GemmKernel - review changes part1 * [CK TILE] Refactor GemmKernel - references fix * [CK TILE] Refactor GemmKernel - naming changes, add problem * [CK_TILE] Refactor GemmKernel - update tests * [CK_TILE] Refactor GemmKernel - review changes * [CK_TILE] Refactor GemmKernel - update test * [CK_TILE] Refactor GemmKernel - constness fixes * [CK_TILE] Refactor GemmKernel - update tests	2024-12-18 17:52:46 +01:00
Adam Osewski	d46196f291	Enhance printing functionality (#1751 ) * Added object print with all template parameters * fix clang format --------- Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-12-17 09:19:44 +01:00
Xu, Shengnan	f57d720c67	added moe interleaving pipeline (#1712 ) * added moe interleaving pipeline * remove redundant code * formater --------- Co-authored-by: root <root@hjbog-srdc-14.amd.com>	2024-12-15 20:13:10 +08:00
chenjun	4e73177684	Ck tile/smoothquant out stride (#1742 ) * add ck_tile/smoothquant out stride parameter * Remove the default stride value --------- Co-authored-by: so <a.com>	2024-12-13 11:53:52 +08:00
carlushuang	77a38e0211	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err	2024-12-12 11:54:03 +08:00
Po Yen Chen	58e7f37fc8	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 )	2024-12-06 12:59:58 +08:00
jakpiase	feb9a2bd9b	Add IsSupportedArgument() to gemm_kernel (#1698 ) * add IsSupportedArgument to gemm_kernel * add ut and do some refactoring * switched to ck_tile's integral_constant	2024-12-05 09:02:13 +01:00
Mateusz Ozga	4cb3d7d7ea	Ck tile grouped GEMM example (#1713 ) * Ck-tile, impl. grouped gemm * Workspace is allocated by user, and is passed to the function * Prepare test to new api design * Unify GemTransKernelArgs, removing N0 param * Add 1 to dim3 in paritioner * Typo: gem - > gemm --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-12-04 21:40:01 +01:00
Po Yen Chen	126ce85aa1	[CK_TILE] Use 'false' for highest dimension padding flags (#1716 ) * Use 'false' for highest dimension padding flags * Update padding flag of bias	2024-12-04 15:59:58 +08:00
Bartłomiej Kocot	cff7fab798	[CK TILE] Fix universal gemm template keywords (#1704 )	2024-11-29 20:51:09 -08:00
aledudek	78f0fea08e	Ck tile batched gemm example (#1615 ) * [CK Tile] Batched GEMM Example * [CK Tile] Batched GEMM Example - minor refactor * [CK Tile] Batched GEMM Example - README update * [CK Tile] Batched Gemm Example - review changes - Added tensor data layours as input parameters - Changed structure of Host and Kernel args - Removed bug with invalid vector read on non-contiguous memory * [CK Tile] Batched Gemm Example - remove comment * [CK Tile] Batched Gemm Example - Add GTests part1 * [CK Tile] Batched Gemm Example - GTests part2 + review changes * [CK TILE] Batched GEMM post merge fixes * [CK Tile] Batched GEMM Example - fix pad views	2024-11-29 11:52:18 +01:00
Bartłomiej Kocot	f49b595dc0	[CK TILE] Add gemm compute pipeline v3 (#1661 ) * [CK TILE] Add gemm compute pipeline v3 * Enable universal gemm compute pipeline. * Rename example and add compute pipeline. * Introduce ag bg cr pipeline impl base. * Refactor to reuse code. * Cleaning * Formatting. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-11-28 17:51:49 +01:00
jakpiase	e7b6286441	Add interwave scheduler for gemm mem pipeline (#1647 ) * add interwave scheduler for gemm mem pipeline * Fix merge artifacts. * Refactor unit tests. * Switch to interwave scheduler for mem example --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-11-27 18:25:07 +01:00
rocking	abae2afc72	support max3 in smoothquant and add+ rmsnorm + rdquant (#1654 ) * Fix cmake example build * Support max3 in smoothquant one pass * support max3 in two pass * support max3 in add_rmsnorm_rdquant	2024-11-27 05:01:15 +08:00
Po Yen Chen	cf2d635ea2	[CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688 ) * Allow getting batch size from splitkv tile partitioner * Fix wrong paged-kvcache impl for group mode * Fix wrong example code for page-kvcache * Undo changes in fmha_fwd.cpp * Always use 2D block table * Add is_gappy kernel argument for paged-kvcache The is_gappy argument is used for differentiating seqstart_k_ptr usage in flash-attention & xformers * Remove out-of-date comments * Remove no-longer used method * Fix wrong # page-block calculation * Fix wrong comment --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2024-11-26 20:37:54 +08:00
Adam Osewski	b6bcd76d88	CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676 ) * Block universal gemm. * Universal block gemm with interwave scheduler - draft. * Refactoring * Move a/b_warp_tiles into BlockGemmImpl * set BlockGemmImpl as a class member * Change tile size for more suitable to memory bound cases. * Introduce kKPerThread to WarpGemm * Add documentation comment. * Fix Interwave scheduler block gemm. * Add compute/memory friendly tile configuration. * Clean * New tile configurations in gemm mem example. * Add more static checks and fix loop order in block gemm. * Add more static checks and use warp gemm mfma dispatcher. * Add default scheduler block gemm. * Remove logging in example.	2024-11-26 08:45:14 +01:00
carlushuang	440e28b08f	[CK_TILE] fused-moe first version (#1634 ) * moe pipeline * update code * compile OK * update * update cpu reference * update pipeline_gemm0 * compiler ok * update pipeline * rename to ex pipeline * block-asm * update * update * update first gemm ok * compute correct * update file structure * update README * update * update * update code * update API * return unsupport case * add comment * update readme * update * uncomment * update * fix build err --------- Co-authored-by: valarLip <340077269@qq.com>	2024-11-26 11:14:56 +08:00
Po Yen Chen	645fe812f6	[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689 ) * Fix mis-matched tuple<> elem types * Rename MakeKargs() as MakeKargsImpl() --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2024-11-25 15:30:35 +08:00
carlushuang	36c7ce4e0e	[CK_TILE]Moe update index (#1672 ) * update MOCK_ID for moe-sorting * add moe-smoothquant * update a comment * fix format * hot fix * update topk in overflow case * update comments * update bf16 cvt --------- Co-authored-by: valarLip <340077269@qq.com>	2024-11-25 13:12:35 +08:00
Qianfeng	ce2bdf42a9	Change in fwd-splitkv kernel to support num_splits=1 case (#1690 ) * Change in fwd-splitkv kernel to support num_splits=1 case * Update in codegen fwd-splitkv to make num_splits > 1 cases pass * Specify instance traits in dispatch * Fix link error for fp8 kernels --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-25 12:31:38 +08:00
schung-amd	ff92222f93	[CK_TILE] MakeKargs overloads for backward compatibility (#1681 ) * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Re-format files using ck_tile remod.py --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-23 06:51:35 +08:00
Po Yen Chen	fb1ccfa9df	[CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678 ) * Generate group mode paged-attn kernel * Enable paged-kvcache + group mode support * Add missing header: fused_moe.hpp * Add comment to explain kernel arg usage * Make error message more clear * Add comment for confusing data member names * Add more comment for confusing variable names * Fix typo in option description	2024-11-21 14:53:10 +08:00
feli	c1f8d53ce8	[Ck_tile] hot fix, fix rpcf param setting err (#1657 ) Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-11-14 14:06:36 +08:00
Bartłomiej Kocot	d20735691c	[CK TILE] Update gemm universal pipeline (#1644 ) * [CK TILE] Update gemm universal pipeline * Fixes * fix * Rebase	2024-11-13 11:46:18 +01:00
Thomas Ning	2b6458ddf2	[CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651 ) * Finished the feature * Modified the test file * Test case update * addresss comment * Addressed the review comment * Fixed the CI error	2024-11-11 18:08:25 -08:00
valarLip	8ef8a994e7	[CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650 ) * [CK_TILE] add more stride for layernorm to support un-continuous Tensor * align CK coding style * extend strides to layernrom expample * clang-format...	2024-11-11 16:02:28 +08:00
Po Yen Chen	13332998a4	Return nullptr when block index is invalid (#1649 )	2024-11-11 09:28:32 +08:00
dummycoderfe	bec6fbc65f	Ck tile/moe sorting (#1624 ) * add moe_sorting & check ok * fix comments & typo * Run remod.py under include/ck_tile & example/ck_tile directories * format codes * fix output ci check bug * fix moe sorting readme and error commit file * use magiv div to accelerate compute * add an loop unroll for moe lds ops * add extblocksnel to set zeros for moebufs * [Ck_tile] moe set zero run ok, add size check and fix ref check * [Ck_tile]fix moe_sorting fuse set_zero remod * [Ck_tile] change name style, fix zero buffer size err, change folder * [Ck_tile] moe_sorting: fix name style * [Ck_tile] moe_sorting, remove useless params in traits * [Ck_tile] change outputtile cnt * unit_size; change output buf alloc --------- Co-authored-by: dummycoderfe <noplydummmycoder@163.com> Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-11-09 17:57:27 +08:00
dummycoderfe	686a58a912	[Ck tile] layernorm2d fwd optimize (#1637 ) * optimze small N case using vec io and using rcp div * [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass * [Ck_tile] fix blockSize compute in Generic2dBlockShape * [Ck_tile]fix kfastfdiv template style * [Ck_tile] layernorm, fix stype in review --------- Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-11-08 12:28:23 +08:00
carlushuang	cb6c5d39dc	[CK_TILE] layernorm have more accurate residual (#1623 ) * more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-11-02 13:30:16 +08:00
rocking	fbd654545a	[Ck_tile] smoothquant (#1617 ) * fix compile error * fix typo of padding * Add smoothquant op * Add smoothquant instance library * refine type * add test script * Re-generate smoothquant.hpp * Always use 'current year' in copyright * use Generic2dBlockShape instead * Add vector = 8 instance back * Find exe path automatically * Simplify the api condition * Remove debugging code * update year * Add blank line between function declaration * explicitly cast return value to dim3 * refine return value * Fix default warmup and repeat value * Add comment * refactor sommthquant cmake * Add README * Fix typo --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>	2024-11-01 13:51:56 +08:00
carlushuang	550248deec	[layernorm] hot fix (#1620 ) * hot fix ln * some rename	2024-11-01 11:52:50 +08:00
carlushuang	c3a4800c5f	[CK_TILE] layernorm support fused-quant/fused-add (#1604 ) * add prenorm/postnorm support, refactor using generate.py * update README * update README * fix format * update some description and fix format * update format * format * use non-raw for loading * format and update n4096 * dynamic-quant ready * update readme * support fused dynamic-quant * update fused-quant, with smooth * update README * update args * update some based on comment	2024-10-31 14:54:53 +08:00
Bartłomiej Kocot	9a8a52130d	Remove virtual destructors from unary ops (#1610 ) * Remove virtual destructors from unary ops * Fixes * Fixes * clang format fixes	2024-10-30 17:42:50 +01:00
rocking	7d9111545f	clang-format (#1612 )	2024-10-30 08:13:30 -07:00
Adam Osewski	24d996aae1	[CK-Tile] Universal gemm memory bound pipeline (#1558 ) * CK-Tile GEMM with memory bound pipeline. * Memory bound gemm pipeline. * Fix not closed namespace. * Block gemm mem pipeline draft. * Do not use ck_tile:: within ck_tile namespace. * Refactoring & Move Layout info to pipeline problem. * Get hot loop and TailNum information before lunching kernel. * Fixes in pipeline. * Add comment to load_tile_raw and change variable naming style. * Few small changes & formatting. * Do not use macro. * Add gtests. * Use AccDataType for Output of MFMA instruction. * Formatting. * Refactor gemm examples. * Switch over to current block gemm. * Use currently available pipeline policy. * Refactoring and review comment.s * Fixes after merge. * Add missing include. * Add load tile overload which accepts output tensor as parameter. * This give 8% perf boost at the cost of using more registers. * Rename example. * Small changes. * Fix compilation err and lower K. * Support different layouts for A/B * Fix vector size for different layouts. * Rename Alignment into VectorSize * Unblock tests.	2024-10-30 10:05:15 +01:00
rocking	3d60953477	[Ck tile] support rmsnorm and related fusion (#1605 ) * Add reduce2d new api * Prevent user use cross warp reduction * Fix bug of std caculation * Add rmsnorm2d * Add rmsnorm small example * Remove static assert to prevent compile fail * Add script to test performance and correctness * Add missing cmake change * refine naming * refine example of rmsnorm * Fix bug of rmsnorm * Refine naming * Fix cmake * clang format * Refine pipeline name * Add add_rmsnorm2d_rdquant kernel * Add reduce op * host verification * Fix bug of one pass pipeline * Refine tile size * Add two pass pipeline * Rename two pass to three pass * Fix bug of kSaveX == false * Add instance library * Add test script * Fix bug of x verification * Add save_x to trait * Add README * Move reduce2d into reduce folder * Fix bug of welford when number of m warp > 1 * remove reduncant comment * 1. move 06_rmsnorm2d to 10_rmsnorm2d 2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant * clang format and add missing header * Add host validation of add + layernorm2d + rsquant * Revert "Add host validation of add + layernorm2d + rsquant" This reverts commit `936cb45797`. * Remove deprecated flag	2024-10-30 15:22:56 +08:00
Qianfeng	8632221814	[CK_TILE] Add fmha fwd headdim96 support (#1608 ) * Add ceil_to_qualified_tile_length() * Rename kK0BlockLength to kQKHeaddim * Add kSubQKHeaddim concept to support headdim96 * Fix in math.hpp to avoid using __half interfaces * Add LdsBufferSequence instance for headdim96 * Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing * Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time * Reformat one file * Fix text alignment in fmha_fwd_splitkv.py --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-30 14:03:16 +08:00
valarLip	9fbd72e97e	[CK_TILE] add generic_permute (#1607 )	2024-10-29 18:05:53 +08:00
carlushuang	b098b71b05	topk_softmax (#1592 ) * topk_softmax * remove some file * fix atomix linear_offset * address various comment, and change sfc get_index api to static(tuple)	2024-10-26 23:52:49 +08:00
Po Yen Chen	54f0e6f4bb	[CK_TILE] More fmha splitkv optimizations (#1588 ) * Use pre-defined constants for readability * Use vector write for o_acc tensor * Remove no-longer used policy method * Deprecate no-longer used policy/pipeline * Specify gemm0/gemm1 block warps separately in codegen * Fix wrong ps_idx creation logic * Add single-warp block gemm * Supoprt single-warp gemm0 * Make MakeCBlockTile() as static method * Use MakeCBlockTile() to get underlying tile distribution * Use kNumGemm1Warps to compute # threads for gemm1 * Put normal case in the if clause * Refine fmha splitkv block mapping * Refine & fix the lse_acc/o_acc layout * Fix wrong LDS size for K tile * Use kK0=64 for hdim=128,256 fmha splitkv kernels * Use kK1=64 for hdim=32,64,128 fmha splitkv kernels * Undo kK0/kK1 changes * Use more reasonable GetAlignmentV() computation * Using store_tile() in fmha splitkv kernel epilogue	2024-10-26 18:35:45 +08:00
dummycoderfe	9183ce69ca	hot_fix epsilon pos (#1597 ) Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-10-25 11:17:45 +08:00
ltqin	0394f8a713	update layernorm (#1570 ) * port layernorm * change warp_welford.hpp * Update warpshuffle * 1. Add save mean and save std back 2. Move construction of tensor_view and tile_window to operator() * refine welford max count calculation * unify layernorm api * Rename file * Remove save mean and inv std * Revert "refine welford max count calculation" This reverts commit `022365802b`. * Fix order of parameter * refine welford max count calculation again * Remove fp32 instances * Fix bug of padding * refactor api * Support bf16 * Extract common function * Refine arg of operator() * Add kMThreadPerBlock to template parameter * clang format * Refine variable name * Refine file name * remove redundant line * refactor layernorm2d pipeline and add block-per-block utility * fix name * rename more * add more block-per-tile instance * remove duplicated define * update instance for 2048, 1024 case * support up to 2048 now * opt loading * add n1536 * Add two pass pipeline * format * Fix incorrect type * parallel compilation * Use smaller N * fix 2p pass * Support Repeat_M in distribution * Refine nameing * Add reduce example --------- Co-authored-by: letaoqin <letaoqin@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-10-22 09:26:18 +08:00
Po Yen Chen	95e722a3b3	[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577 ) * Use smaller width for lse_accum dist tensor * Update pipeline comment * Fix wrong distribution for lse_accum * Remove duplicate dim in lse_accum dist encoding * Decide fmha splitkv combine kernel kBlockSize by kM0 * Remove assumption of MPerThread=1 * Add log<4> & log<8> specialization * Enlarge occupancy array * Fix vector size for small tile * Add support for kMaxSplits=8 * Re-format gemm.hpp * Use 16x16x16 warp gemm for fwd_splitkv * Centralize policy code changes * Leave fp8/bf8 tile settings unchanged	2024-10-21 10:52:11 +08:00
Qianfeng	14c3cfb1c6	[CK_TILE] Improve headdim96 performance for fmha-bwd (#1573 ) * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96 * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor * Fix in bwd_piple_default_policy * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim * Update to hd96 tile settings * Add smoke test scripts for fmha-bwd hd96 * Revert "Add smoke test scripts for fmha-bwd hd96" This reverts commit `7ca7e1a93d`. * Remove hd96 tile settings in fmha_bwd codegen to save compiling * Fix lost code line in bwd_pipeline_default_policy * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor * tiny adjustments --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: danyao12 <Dan.Yao@amd.com>	2024-10-16 18:14:32 +08:00
Bartłomiej Kocot	d02a92cc0d	[CK_TILE] Add block universal gemm pipeline policy (#1557 ) * [CK_TILE] Add block universal gemm pipeline policy * Fixes * fixes2 * Fixes3 * fixeS	2024-10-15 13:53:41 +02:00
Thomas Ning	35c1777d59	decouple the calling from gemm_pipeline (#1571 ) * decouple the calling from gemm_pipeline * clang format	2024-10-14 13:59:26 +08:00
Thomas Ning	6f27bc9872	Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535 ) * ake the cshuffle compilable * modify Mhe reference on gpu and cpu. Correaccess of cshuffle * fix the cpu reference code * Complete the in tile shuffle logic * restructure the kernel template input * change the naming pattern of ck_tile gemm pipeline * Re-format files using remod.py * Solve the fmha conflict with gemm * Comment Addressed from Carlus --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>	2024-10-10 18:02:22 +08:00

... 2 3 4 5 6

279 Commits