composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 18:17:58 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	a064792e96	Fix universal gemm profiler for pk_i4_t (#1790 ) * Fix universal gemm profiler for pk_i4_t * fix [ROCm/composable_kernel commit: `888317e698`]	2025-01-04 14:01:33 +01:00
dependabot[bot]	19603c0e45	Bump rocm-docs-core from 1.12.0 to 1.12.1 in /docs/sphinx (#1788 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.12.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.12.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `37b3514648`]	2025-01-03 17:47:48 -08:00
Illia Silin	68c7f53cb1	terminology clean-up (#1792 ) [ROCm/composable_kernel commit: `8ea375bb58`]	2025-01-03 16:38:22 -08:00
carlushuang	60e814a3ba	[CK_TILE]naive attn support FP8 KVCache quant (#1747 ) * quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6df5fe2ad8`]	2025-01-03 18:43:07 +08:00
Mingtao Gu	d4a8c6c2ed	Implement the fp16xint4 scale weight only kernel for Ali (#1786 ) * enable int4 scale (weight only) kernel * format some files * Add unit test for int4 weight only * fixed and formatted code * fixed * formated * formated * fixed * fixed a bug in the ckProfiler, and formatted the code --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `4f62f6e9b7`]	2025-01-03 18:35:21 +08:00
feli	5ce28a1d13	Ck tile/layernorm: implement naive reduce, opt performance (#1784 ) * add no welford * enable output raw * raw of int8 * fix build * fix smoke test err * [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate * [cktile]layernorm, fix err commit files and remove uselss * fix quant 8192 err & change norm_reduce class and file name --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `4bc610416a`]	2025-01-03 14:28:59 +08:00
John Afaganis	de674980aa	Add afagaj to CODEOWNERS (#1787 ) [ROCm/composable_kernel commit: `17e8efb573`]	2025-01-02 20:50:07 -06:00
Muhammed Emin Ozturk	222b1d6b48	BF16 GEMM Stream-K (#1541 ) * initial * Cmake file * successfull compilation but validation failed * Cmake * update * gpu validation * gemm universal * gemm universal sk update * sk bf16 universal instance * gemm_universal_streamk.hpp * only build for gfx94 * Cmakelist * profiler update, bf16 sk only works at gfx42 * clang * clang * clang all * no need flags * cmake script * delete comment * gemm universal sk fix * clang * profiler fix * clang * update * update * delete comment * code formatting * cmake * fix instance * clang * argument supported * argument supported and clang * update * fix * removing unnecessary comments * clang formatting * Update library/src/tensor_operation_instance/gpu/CMakeLists.txt Co-authored-by: afagaj <john.afaganis@gmail.com> * CopyRight Comment 2025 * clang reformatting * copy right 2025 --------- Co-authored-by: Emin Ozturk <ozturk.27@osu.edu> Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund> Co-authored-by: root <root@splinter-126-wr-d3.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund> Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund> Co-authored-by: afagaj <john.afaganis@gmail.com> [ROCm/composable_kernel commit: `9e95d54cd2`]	2025-01-02 10:30:04 -08:00
Adam Osewski	ac74520ff6	Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 ) * add a prototype of int4 * clean * debug * clean * clean * move packed into dynamic_buffer * fixed coord reset * add fast pki4 to half conversion * fix * fixed reference and host_tensor * fixed tensor init * format * debug i4_to_f16_convert * format * fixed splitk * weight permute * add b tile permute * clean * weight permute with splitki * format * improve weight layout * add and_or_b32 * fixed splitk crush * add permute switch as a template * recover v3r1 * clean * failure with intrawave v2 * fixed * fixed * add ckProfiler * add bfp16 support * add bf16 example * fixed int4 to bhalf_t conversion * format * fixed int4 to bf16 conversion * clean * add instances for mem * clean * fixed host tensor size * fixed * debug * fixed * add pk_i4_t as a struct * fix * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * revert * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed comments * revert * clean * revert * revert * fixed * Update CMakeLists.txt * Update script/cmake-ck-dev.sh Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update CMakeLists.txt Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed * fixed * fixed * revert * revert * add comments * format * fixed assert * fixed * Fix I4 define in ckProfiler * Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `1d8e4ec2ce`]	2025-01-02 11:48:06 +08:00
Bartłomiej Kocot	a860c20099	Add NGCHW bf16 grouped conv fwd instances (#1783 ) * Add NGCHW bf16 grouped conv fwd instances * add missed cmake [ROCm/composable_kernel commit: `159fa31946`]	2025-01-01 18:00:06 +01:00
Qianfeng	8c1883a424	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex [ROCm/composable_kernel commit: `4e076909b6`]	2024-12-29 14:29:56 +08:00
Bartłomiej Kocot	7fbc8a9ac1	[CK TILE] GEMM and Batched GEMM SplitK support (#1724 ) * [CK TILE] Add split K support in GEMM * Updates * Fixes * rebase * fix * Fix * fixes * support for batched gemm [ROCm/composable_kernel commit: `af66494880`]	2024-12-28 14:40:17 +01:00
Po Yen Chen	1e65b3ab35	Correct the dtype checking logics (#1775 ) [ROCm/composable_kernel commit: `4c2eff023a`]	2024-12-25 23:57:28 +08:00
carlushuang	4c4be7b14f	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code [ROCm/composable_kernel commit: `3d15f364b3`]	2024-12-23 10:59:02 +08:00
Illia Silin	c369965615	fix typo for CK_USE_OCP_FP8 (#1769 ) [ROCm/composable_kernel commit: `07339c7383`]	2024-12-20 07:52:24 -08:00
carlushuang	0d16f9b5c7	hot-fix (#1768 ) [ROCm/composable_kernel commit: `1c45ca35dd`]	2024-12-20 16:40:45 +08:00
Po Yen Chen	f5c4569acd	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `37cdbf4f0e`]	2024-12-20 14:41:01 +08:00
Illia Silin	17e73266d5	fix profiler_grouped_gemm (#1766 ) [ROCm/composable_kernel commit: `2944c50894`]	2024-12-19 17:24:05 -08:00
Mateusz Ozga	e08b9f7cce	Apply Ck-tile argument parser for vectors [I/O] (#1758 ) * Parser for a vector was added. Additionaly we valid correctnes of numbers * Remove unnecessary comments * Review part 1 * Review part 2 * Add const to variadic lambda * Rename C->K [ROCm/composable_kernel commit: `e758d006a5`]	2024-12-19 17:55:35 +01:00
aledudek	d9025d054d	[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730 ) * Gemm Kernel Refactor part1 * Gemm Kernel Refactor common gemm pipeline part2 * [CK TILE] Refactor batched gemm to reuse GemmKernel * [CK TILE] Refactor GemmKernel - review changes part1 * [CK TILE] Refactor GemmKernel - references fix * [CK TILE] Refactor GemmKernel - naming changes, add problem * [CK_TILE] Refactor GemmKernel - update tests * [CK_TILE] Refactor GemmKernel - review changes * [CK_TILE] Refactor GemmKernel - update test * [CK_TILE] Refactor GemmKernel - constness fixes * [CK_TILE] Refactor GemmKernel - update tests [ROCm/composable_kernel commit: `453ca37347`]	2024-12-18 17:52:46 +01:00
Xiaodong Wang	eb09d3a572	Disambiguate bit_cast (#1749 ) Adding namespace to disambiguate with std::bit_cast Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `1c1b336371`]	2024-12-18 18:32:38 +08:00
aledudek	ce345cc50e	[CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743 ) * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix [ROCm/composable_kernel commit: `f6c4d614e3`]	2024-12-18 09:45:58 +01:00
Harisankar Sadasivan	4f35cc87fe	updated fp16 instances to be on parity with universal gemm instances (#1754 ) * updated fp16 instances to be on parity with universal gemm instances * corrected instance name to streamk instance [ROCm/composable_kernel commit: `d9e37c6874`]	2024-12-17 10:31:21 -08:00
Illia Silin	57d3525983	Pass build flags to config.h (#1760 ) * pass the build flags to config.h * fix clang format [ROCm/composable_kernel commit: `689a5ae45b`]	2024-12-17 10:17:29 -08:00
Max Podkorytov	477e028e58	refactor conditional usage; fix build on rocm6.1 where the reference didn't exist [ROCm/composable_kernel commit: `6ef8d3c295`]	2024-12-17 08:40:18 -08:00
dependabot[bot]	b0cd5fea51	Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `0e54d7ae5a`]	2024-12-17 06:57:55 -08:00
jakpiase	430eafe07d	Added unit tests for CK Tile compute bound gemm pipeline (#1728 ) [ROCm/composable_kernel commit: `627a27bda3`]	2024-12-17 14:25:22 +01:00
Adam Osewski	481555748d	Enhance printing functionality (#1751 ) * Added object print with all template parameters * fix clang format --------- Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `d46196f291`]	2024-12-17 09:19:44 +01:00
Max Podkorytov	4fb6f1c199	clarify release notes bullet point [ROCm/composable_kernel commit: `0fd6978d2a`]	2024-12-16 10:46:19 -08:00
Max Podkorytov	084e264f0a	add contributing placeholder [ROCm/composable_kernel commit: `1b75c77da4`]	2024-12-16 10:46:19 -08:00
Max Podkorytov	6107f3ab89	add pull request template placeholder [ROCm/composable_kernel commit: `30a37cac0e`]	2024-12-16 10:46:19 -08:00
Max Podkorytov	16166b49a8	add template placeholders [ROCm/composable_kernel commit: `a8ad7fcce9`]	2024-12-16 10:46:19 -08:00
Illia Silin	b75aad4943	upgrade sqlalchemy version (#1748 ) * upgrade sqlalchemy version * replace the connection with engine in to_sql call * change the hipTes=nsor ctest syntax [ROCm/composable_kernel commit: `fdfe210230`]	2024-12-15 16:25:21 -08:00
Xu, Shengnan	e63c346bf2	added moe interleaving pipeline (#1712 ) * added moe interleaving pipeline * remove redundant code * formater --------- Co-authored-by: root <root@hjbog-srdc-14.amd.com> [ROCm/composable_kernel commit: `f57d720c67`]	2024-12-15 20:13:10 +08:00
Illia Silin	6f4b8ba3fd	upgrade pandas package (#1746 ) [ROCm/composable_kernel commit: `d68974a5c6`]	2024-12-13 16:30:39 -08:00
Illia Silin	61dab707af	Add zstd lib for building hipTensor. (#1745 ) * add zstd library to CI docker * fix the libzstd name [ROCm/composable_kernel commit: `41ebf117a5`]	2024-12-13 16:30:22 -08:00
Bartłomiej Kocot	4111d2fbfd	Add SplitK support into Batched GEMM V3 (#1729 ) * add bmm api * add bf16 multi_d * add ckProfiler for bf16 * add ckProfiler files * add more instance; fixed 64bit index issue * fixed naming * enabled batched Ds * use long_index for ds offsets * clean * add bmm fp8 ckProfiler * Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/src/profile_gemm_universal_batched.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * clean * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * refactor batch offset func * add splitk suppport into bmm_v3 * clean * clean * format * fixed * fix --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `4d8fce33dd`]	2024-12-13 21:08:35 +01:00
chenjun	26839ac17b	Ck tile/smoothquant out stride (#1742 ) * add ck_tile/smoothquant out stride parameter * Remove the default stride value --------- Co-authored-by: so <a.com> [ROCm/composable_kernel commit: `4e73177684`]	2024-12-13 11:53:52 +08:00
carlushuang	b675a4d4ba	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err [ROCm/composable_kernel commit: `77a38e0211`]	2024-12-12 11:54:03 +08:00
Illia Silin	b0cb070311	add missing stdexcept header (#1740 ) [ROCm/composable_kernel commit: `357a0b1c57`]	2024-12-10 15:16:03 -08:00
Illia Silin	21ae5f75c3	Upgrade to Ubuntu22.04 as default OS. (#1738 ) * upgrade to ubuntu 22.04 * try adding -u roof docker options for ubuntu 22 [ROCm/composable_kernel commit: `90d8410d56`]	2024-12-10 08:48:51 -08:00
Jatin Chaudhary	09598a1337	Make sure we call __hneg with half to remove ambigios error (#1736 ) [ROCm/composable_kernel commit: `67497a044d`]	2024-12-10 08:47:36 -08:00
rocking	6e778cc529	[CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731 ) * Add data type config, Prepare to add mix precision in the future * Fix compile error [ROCm/composable_kernel commit: `94ae7113bd`]	2024-12-10 11:36:18 +08:00
Illia Silin	d65f9559ff	build CI for gfx12 by default (#1734 ) [ROCm/composable_kernel commit: `23cf2026b4`]	2024-12-09 14:11:20 -08:00
Illia Silin	f9d1334a29	update CI timeout limits (#1733 ) [ROCm/composable_kernel commit: `2f088b8707`]	2024-12-09 09:32:14 -08:00
Illia Silin	d34b5e54a9	remove unnecessary file (#1732 ) [ROCm/composable_kernel commit: `c773cc25a2`]	2024-12-09 08:50:36 -08:00
Illia Silin	f76071125c	Refactor CI performance tests. (#1726 ) * merge the build and performance tests CI stages together * add gemm performance test on gfx11/gfx12 * add suffices to distinguish gemm performance logs from different archs * use smaller gemm set in CI for gfx10/gfx11/gfx12 * disable performance tests on gfx1030 * fix the shashing logic * fix finding python3 for mha instances [ROCm/composable_kernel commit: `355893cdd8`]	2024-12-06 13:04:25 -08:00
Rostyslav Geyyer	99040fae57	Add copy assignment op test (#1718 ) * Add copy assignment op test * Add a deep copy testing [ROCm/composable_kernel commit: `5e6bd75a72`]	2024-12-06 09:56:27 -06:00
Bartłomiej Kocot	8df7da97a2	Support large batch tensors in grouped conv bwd data (#1711 ) * Support large batch tensors in grouped conv bwd data * Fix multiD * fixes * fixes * fixes [ROCm/composable_kernel commit: `261f1759de`]	2024-12-06 10:55:23 +01:00
Po Yen Chen	44b0100283	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 ) [ROCm/composable_kernel commit: `58e7f37fc8`]	2024-12-06 12:59:58 +08:00

1 2 3 4 5 ...

1608 Commits