composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
deepsek	dde428cdf9	Added bf16 instances grouped gemm fixed nk (#1825 ) * Feat: Add bf16 input instances * feat: Add BF16 profiler code * fix: reorder enum types * fix: CI fail due to clang-format * fix: clang script format issue * fix: clang format broke cmakelist file [ROCm/composable_kernel commit: `e7dce4d247`]	2025-01-20 09:13:09 -08:00
lucbruni-amd	c54cff82f0	Add CK_TIME_KERNEL as toggleable CMake Variable (#1794 ) * Disable CK_TIME_KERNEL by Default, Add as CMake Variable * Enable CK_TIME_KERNEL by Default, Maintaining CMake Variable Functionality. * Fix build error. [ROCm/composable_kernel commit: `3fb2f5acc7`]	2025-01-20 07:09:19 -08:00
Mingtao Gu	2488221b57	fix a bug for int4 scale weight only kernel (#1820 ) Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `86d1b46aa6`]	2025-01-19 11:18:18 +08:00
Bartłomiej Kocot	7e74e0071e	[CK_TILE] Add error threshold calculation for gemm examples (#1821 ) [ROCm/composable_kernel commit: `bdddf1eace`]	2025-01-18 01:01:52 +01:00
deepsek	7290b1a8dd	fix: preprocessor directives logic error if/else (#1764 ) * fix: preprocessors logic error if/else * fix: added macros as preferred by CK team [ROCm/composable_kernel commit: `0fcbb25f70`]	2025-01-16 20:31:15 -08:00
Aviral Goel	ebe7a75ab4	Implementing Test Filters for Smoke and Regression Tests (#1819 ) * smoke and regression targets working with tests * test filters work for both examples and test * removed uneccesary comments * added a missing comment * added a missing comment * fixed typo in the comments * updated README * Update PULL_REQUEST_TEMPLATE.md updating the template for future addition of test cases * Update PULL_REQUEST_TEMPLATE.md [ROCm/composable_kernel commit: `54de3e55e1`]	2025-01-16 16:40:08 -08:00
Bartłomiej Kocot	bddf6109f0	Fix and optimize dynamic unary elementwise (#1818 ) * Fix and optimize dynamic unary elementwise * fix [ROCm/composable_kernel commit: `1519ce91a3`]	2025-01-16 13:48:39 -08:00
carlushuang	21264b4e60	[CK_TILE] Fix mock token id, support g1u1/g1u0 through same inline code block (#1808 ) * fix mock token id * prepare host for g1u1 * reformat inline-asm * restructure uk_0 * restructure gate_up * done * change default to init=1 * update readme * fix a bug in interleave pipeline * rcp for silu [ROCm/composable_kernel commit: `1ff50e78c6`]	2025-01-16 17:51:10 +08:00
Illia Silin	11eab9ca17	disable inductor codegen tests on legacy OS (#1816 ) [ROCm/composable_kernel commit: `8c29e06f3c`]	2025-01-15 12:11:54 -08:00
Bartłomiej Kocot	dbdb10573e	Add rounding for float to bf16 conversion as default (#1812 ) * Add rounding for float to bf16 conversion * Add bhalf test * Add inf test bhalf * Refactor * update cmake * Fixes [ROCm/composable_kernel commit: `7790e8c3f7`]	2025-01-15 16:41:21 +01:00
ruanjm	e540af436e	[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802 ) * Add shortcut to RMSNorm * Modify test for adding shortcut for RMSNorm * Add fused parameter into tests * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp * 1. Supports various stride and percisions. * Add support of Epilogue * Add fuse and epilogue support to rmsnorm ref * Modify rmsnorm example * Refactor tests/examples * Bug fix for newly added tests/examples * Bug fix for new tests 2 * Modify smoke test scripts remove dbg code * Supports non-smooth dyanmic quant * Update Rmsnorm2dFwd::GetName() * rename xscale and prec_sx to smoothscale and prec_sm Bug fix after rename Remove files * change example_rmsnorm2d_fwd.cpp * update performance calculator * Fix issue in two-pass when fuse add is enabled * Remove comment of beta --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `04dd314883`]	2025-01-15 10:23:48 +08:00
Max Podkorytov	971203e15a	fix parsing instances for pt inductor (#1796 ) add unit test for gen instances for gemms add unit tests for conv and batched gemms add unit test for preselected gemm instances apply ruff lint add license header for the unit test add inductor pytest to CI verbose pip install switch the directory before installing python packages move the inductor codegen test try yet another workdir Update Jenkinsfile The directory looks right, fixing pip module not found by invoking pip directly Update Jenkinsfile invoke pytest directly since the module is not found Update Dockerfile Install setuptools update package structure bump setuptools maybe fix data path for library sources fix library search path for conv instances fix path in pyproject definition compare path used in gen_instances with one in pyproject.toml; fix the difference Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `c0b90f130f`]	2025-01-13 13:51:08 -08:00
feli	7ef740e21b	Dev/merge u8w8 (#1774 ) * port tiles from a8w8 * rm debug used files * add instances * remove all non gemm in cmake * merge; impl fp16 * recover cmake from develop * add missed files; fix clang format --------- Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `53ab1b9047`]	2025-01-13 10:25:14 -08:00
Thomas Ning	c3533b8074	CK Tile GEMM CICD fixed & register block method refactor (#1776 ) * refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm * Finished the 2x2 warp gemm policy and the block selection mechanism * Clang format * address poyen's comment * Address feedbacks * Fixed the compilation issue * Change the function name [ROCm/composable_kernel commit: `5d671a5fc4`]	2025-01-13 13:10:44 +08:00
ClementLinCF	8549197d6e	[CK_TILE] Adjust kBlockSize of reduce example for better perf (#1779 ) * Observed a 2x perf improvement with kBlockSize = 256 * Using 512 threads may lead to redundant computations [ROCm/composable_kernel commit: `0b8f117f1a`]	2025-01-12 20:50:32 -08:00
Qianfeng	83456e0a46	Update for fmha_fwd qs_ks_vs pipeline (#1810 ) * Update for fmha_fwd qs_ks_vs pipeline * Remove _builtin_amdgcn_sched_barrier(0) * Move p_compute to p converting earlier for trying to increase vgprs re-using * Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation * Re-add __builtin_amdgcn_sched_barrier(0) --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `3d50f57f43`]	2025-01-13 12:43:05 +08:00
Bartłomiej Kocot	d63ce9c186	Grouped convolution backward weight special vector size loads (#1772 ) * Grouped convolution backward weight special vector size loads * Instnaces and tests * Fixes * Add 7 and 13 special cases * fix comments * Fix * Fix2 * fixes * fix atomic add bf16 [ROCm/composable_kernel commit: `fd46a01d8b`]	2025-01-10 22:02:30 +08:00
Thomas Ning	f273d0a699	Ck tile/gemm perf measure (#1750 ) * Finished adding the performance benchmark for ck tile gemm * Fix the executable rename problem * fix the executable name error * delete the unsupported layout combinations * Update run_full_test.sh * Update benchmark_mem_pipeline.sh * Update benchmark_basic.sh * change the executable of gemm_universal * change ck_tile_gemm script permissions * Addressed the comment * Addressed the comment * Fixed the comments * Fixed Comment * roll back the malfunctioned change * Fix the Typo * finalize the tile_gemm_fp16 performance monitoring * fix the stash names for ck_tile gemm logs * change the stashing logic * change stashing syntax --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `73a076eee1`]	2025-01-09 17:41:49 -08:00
darren-amd	1db3581b08	Disable building DPP kernels by default (#1804 ) * Disable building DPP kernels by default * Disable building dpp instances, examples, or tests if DPP_KERNELS is not set * Add new DPP_KERNELS flag to readme [ROCm/composable_kernel commit: `26b3829c02`]	2025-01-08 13:50:42 -05:00
Max Podkorytov	373f4a70d7	mark unused args [ROCm/composable_kernel commit: `ad697c78ac`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	b0723a590b	run clang-format -style=file [ROCm/composable_kernel commit: `a2e6ad62e2`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	ac14cbd12b	run clang-format==12 [ROCm/composable_kernel commit: `aa59ecaa22`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	0e72a941ab	update comment in the policy [ROCm/composable_kernel commit: `82fb3f84fb`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	fe8f764cf1	update qsksvs comment [ROCm/composable_kernel commit: `4daa82b451`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	cede3a6f86	remove dead code [ROCm/composable_kernel commit: `66c5b715c9`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	b5feb60a35	clang-format and remove dead code [ROCm/composable_kernel commit: `edb78a4729`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	62b1f2e830	roll back splitkv [ROCm/composable_kernel commit: `60113859fa`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	63fdc5c84e	update qsksvs pipeline [ROCm/composable_kernel commit: `bfc997a7e6`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	e74ae6664a	qsksvs pipeline changes to mirror qrksvs [ROCm/composable_kernel commit: `f7942b993c`]	2025-01-08 10:09:54 -08:00
AMD-dteng	723ccf741e	enable bias feature that add bias before adding residual (for rtpllm project) (#1741 ) * 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16 * delete comment * 1.remove fmha change 2.change buffer name from bias to xbias * Now bias can be used independently from fadd * change kbias to kxbias --------- Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `d5c8a334ca`]	2025-01-08 17:51:06 +08:00
spolifroni-amd	aae38551a9	Update LICENSE to 2025 (#1797 ) [ROCm/composable_kernel commit: `a6b761c39a`]	2025-01-07 08:29:40 -08:00
dependabot[bot]	509f31a21b	Bump rocm-docs-core from 1.12.1 to 1.13.0 in /docs/sphinx (#1798 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.1 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.1...v1.13.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `9f6bf9ab26`]	2025-01-07 08:26:31 -08:00
Po Yen Chen	9e1b6988b8	[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789 ) * Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com> [ROCm/composable_kernel commit: `24b12d04af`]	2025-01-07 18:49:24 +08:00
Bartłomiej Kocot	f8edc6917f	Fix universal gemm profiler for pk_i4_t (#1790 ) * Fix universal gemm profiler for pk_i4_t * fix [ROCm/composable_kernel commit: `888317e698`]	2025-01-04 14:01:33 +01:00
dependabot[bot]	ea043b6632	Bump rocm-docs-core from 1.12.0 to 1.12.1 in /docs/sphinx (#1788 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.12.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.12.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `37b3514648`]	2025-01-03 17:47:48 -08:00
Illia Silin	c9cfd80732	terminology clean-up (#1792 ) [ROCm/composable_kernel commit: `8ea375bb58`]	2025-01-03 16:38:22 -08:00
carlushuang	7804fe97aa	[CK_TILE]naive attn support FP8 KVCache quant (#1747 ) * quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6df5fe2ad8`]	2025-01-03 18:43:07 +08:00
Mingtao Gu	857c15e8d6	Implement the fp16xint4 scale weight only kernel for Ali (#1786 ) * enable int4 scale (weight only) kernel * format some files * Add unit test for int4 weight only * fixed and formatted code * fixed * formated * formated * fixed * fixed a bug in the ckProfiler, and formatted the code --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `4f62f6e9b7`]	2025-01-03 18:35:21 +08:00
feli	b78e94cc81	Ck tile/layernorm: implement naive reduce, opt performance (#1784 ) * add no welford * enable output raw * raw of int8 * fix build * fix smoke test err * [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate * [cktile]layernorm, fix err commit files and remove uselss * fix quant 8192 err & change norm_reduce class and file name --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `4bc610416a`]	2025-01-03 14:28:59 +08:00
John Afaganis	fb0cc96d31	Add afagaj to CODEOWNERS (#1787 ) [ROCm/composable_kernel commit: `17e8efb573`]	2025-01-02 20:50:07 -06:00
Muhammed Emin Ozturk	6c518cbc76	BF16 GEMM Stream-K (#1541 ) * initial * Cmake file * successfull compilation but validation failed * Cmake * update * gpu validation * gemm universal * gemm universal sk update * sk bf16 universal instance * gemm_universal_streamk.hpp * only build for gfx94 * Cmakelist * profiler update, bf16 sk only works at gfx42 * clang * clang * clang all * no need flags * cmake script * delete comment * gemm universal sk fix * clang * profiler fix * clang * update * update * delete comment * code formatting * cmake * fix instance * clang * argument supported * argument supported and clang * update * fix * removing unnecessary comments * clang formatting * Update library/src/tensor_operation_instance/gpu/CMakeLists.txt Co-authored-by: afagaj <john.afaganis@gmail.com> * CopyRight Comment 2025 * clang reformatting * copy right 2025 --------- Co-authored-by: Emin Ozturk <ozturk.27@osu.edu> Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund> Co-authored-by: root <root@splinter-126-wr-d3.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund> Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund> Co-authored-by: afagaj <john.afaganis@gmail.com> [ROCm/composable_kernel commit: `9e95d54cd2`]	2025-01-02 10:30:04 -08:00
Adam Osewski	6f088a384c	Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 ) * add a prototype of int4 * clean * debug * clean * clean * move packed into dynamic_buffer * fixed coord reset * add fast pki4 to half conversion * fix * fixed reference and host_tensor * fixed tensor init * format * debug i4_to_f16_convert * format * fixed splitk * weight permute * add b tile permute * clean * weight permute with splitki * format * improve weight layout * add and_or_b32 * fixed splitk crush * add permute switch as a template * recover v3r1 * clean * failure with intrawave v2 * fixed * fixed * add ckProfiler * add bfp16 support * add bf16 example * fixed int4 to bhalf_t conversion * format * fixed int4 to bf16 conversion * clean * add instances for mem * clean * fixed host tensor size * fixed * debug * fixed * add pk_i4_t as a struct * fix * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * revert * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed comments * revert * clean * revert * revert * fixed * Update CMakeLists.txt * Update script/cmake-ck-dev.sh Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update CMakeLists.txt Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed * fixed * fixed * revert * revert * add comments * format * fixed assert * fixed * Fix I4 define in ckProfiler * Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `1d8e4ec2ce`]	2025-01-02 11:48:06 +08:00
Bartłomiej Kocot	d72d2475fd	Add NGCHW bf16 grouped conv fwd instances (#1783 ) * Add NGCHW bf16 grouped conv fwd instances * add missed cmake [ROCm/composable_kernel commit: `159fa31946`]	2025-01-01 18:00:06 +01:00
Qianfeng	d7de39fab3	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex [ROCm/composable_kernel commit: `4e076909b6`]	2024-12-29 14:29:56 +08:00
Bartłomiej Kocot	a5a7f2675f	[CK TILE] GEMM and Batched GEMM SplitK support (#1724 ) * [CK TILE] Add split K support in GEMM * Updates * Fixes * rebase * fix * Fix * fixes * support for batched gemm [ROCm/composable_kernel commit: `af66494880`]	2024-12-28 14:40:17 +01:00
Po Yen Chen	282f02cc66	Correct the dtype checking logics (#1775 ) [ROCm/composable_kernel commit: `4c2eff023a`]	2024-12-25 23:57:28 +08:00
carlushuang	edccbb3694	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code [ROCm/composable_kernel commit: `3d15f364b3`]	2024-12-23 10:59:02 +08:00
Illia Silin	75a6d8c472	fix typo for CK_USE_OCP_FP8 (#1769 ) [ROCm/composable_kernel commit: `07339c7383`]	2024-12-20 07:52:24 -08:00
carlushuang	534b95c05c	hot-fix (#1768 ) [ROCm/composable_kernel commit: `1c45ca35dd`]	2024-12-20 16:40:45 +08:00
Po Yen Chen	08975b5a8f	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `37cdbf4f0e`]	2024-12-20 14:41:01 +08:00

1 2 3 4 5 ...

1650 Commits