composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 03:07:02 +00:00

Author	SHA1	Message	Date
coderfeli	db84352941	fix warnings and revert cmake and fix clang format	2024-12-30 08:24:50 +00:00
coderfeli	5765ba51ce	auto calculate hard code params	2024-12-30 07:59:47 +00:00
coderfeli	3f9dbcac63	use new pipeline for b preshuffle, run ok; revert olds to fix ckprofiler	2024-12-30 06:52:10 +00:00
coderfeli	54f44e6232	fix brepeat, kloop and lds two buffer; works ok now	2024-12-30 00:25:46 +00:00
coderfeli	2c056624af	fix tail	2024-12-27 08:30:03 +00:00
coderfeli	174b46b04a	add cpu shuffle	2024-12-27 07:31:14 +00:00
coderfeli	e6f5a78b14	add double buffer scratch	2024-12-26 15:02:04 +00:00
coderfeli	3784329b68	can run	2024-12-26 13:01:07 +00:00
coderfeli	4a1ec81595	add bypass logic and build	2024-12-26 10:05:25 +00:00
coderfeli	19b7c1312c	remove all non gemm in cmake	2024-12-24 07:44:12 +00:00
coderfeli	5e5e1a50f9	add instances	2024-12-24 07:43:09 +00:00
coderfeli	9ba219c875	rm debug used files	2024-12-24 03:59:12 +00:00
coderfeli	3f50b99e7b	port tiles from a8w8	2024-12-23 14:13:16 +00:00
carlushuang	3d15f364b3	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code	2024-12-23 10:59:02 +08:00
Illia Silin	07339c7383	fix typo for CK_USE_OCP_FP8 (#1769 )	2024-12-20 07:52:24 -08:00
carlushuang	1c45ca35dd	hot-fix (#1768 )	2024-12-20 16:40:45 +08:00
Po Yen Chen	37cdbf4f0e	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2024-12-20 14:41:01 +08:00
Illia Silin	2944c50894	fix profiler_grouped_gemm (#1766 )	2024-12-19 17:24:05 -08:00
Mateusz Ozga	e758d006a5	Apply Ck-tile argument parser for vectors [I/O] (#1758 ) * Parser for a vector was added. Additionaly we valid correctnes of numbers * Remove unnecessary comments * Review part 1 * Review part 2 * Add const to variadic lambda * Rename C->K	2024-12-19 17:55:35 +01:00
aledudek	453ca37347	[CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730 ) * Gemm Kernel Refactor part1 * Gemm Kernel Refactor common gemm pipeline part2 * [CK TILE] Refactor batched gemm to reuse GemmKernel * [CK TILE] Refactor GemmKernel - review changes part1 * [CK TILE] Refactor GemmKernel - references fix * [CK TILE] Refactor GemmKernel - naming changes, add problem * [CK_TILE] Refactor GemmKernel - update tests * [CK_TILE] Refactor GemmKernel - review changes * [CK_TILE] Refactor GemmKernel - update test * [CK_TILE] Refactor GemmKernel - constness fixes * [CK_TILE] Refactor GemmKernel - update tests	2024-12-18 17:52:46 +01:00
Xiaodong Wang	1c1b336371	Disambiguate bit_cast (#1749 ) Adding namespace to disambiguate with std::bit_cast Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-12-18 18:32:38 +08:00
aledudek	f6c4d614e3	[CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743 ) * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes * [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix	2024-12-18 09:45:58 +01:00
Harisankar Sadasivan	d9e37c6874	updated fp16 instances to be on parity with universal gemm instances (#1754 ) * updated fp16 instances to be on parity with universal gemm instances * corrected instance name to streamk instance	2024-12-17 10:31:21 -08:00
Illia Silin	689a5ae45b	Pass build flags to config.h (#1760 ) * pass the build flags to config.h * fix clang format	2024-12-17 10:17:29 -08:00
Max Podkorytov	6ef8d3c295	refactor conditional usage; fix build on rocm6.1 where the reference didn't exist	2024-12-17 08:40:18 -08:00
dependabot[bot]	0e54d7ae5a	Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-17 06:57:55 -08:00
jakpiase	627a27bda3	Added unit tests for CK Tile compute bound gemm pipeline (#1728 )	2024-12-17 14:25:22 +01:00
Adam Osewski	d46196f291	Enhance printing functionality (#1751 ) * Added object print with all template parameters * fix clang format --------- Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-12-17 09:19:44 +01:00
Max Podkorytov	0fd6978d2a	clarify release notes bullet point	2024-12-16 10:46:19 -08:00
Max Podkorytov	1b75c77da4	add contributing placeholder	2024-12-16 10:46:19 -08:00
Max Podkorytov	30a37cac0e	add pull request template placeholder	2024-12-16 10:46:19 -08:00
Max Podkorytov	a8ad7fcce9	add template placeholders	2024-12-16 10:46:19 -08:00
Illia Silin	fdfe210230	upgrade sqlalchemy version (#1748 ) * upgrade sqlalchemy version * replace the connection with engine in to_sql call * change the hipTes=nsor ctest syntax	2024-12-15 16:25:21 -08:00
Xu, Shengnan	f57d720c67	added moe interleaving pipeline (#1712 ) * added moe interleaving pipeline * remove redundant code * formater --------- Co-authored-by: root <root@hjbog-srdc-14.amd.com>	2024-12-15 20:13:10 +08:00
Illia Silin	d68974a5c6	upgrade pandas package (#1746 )	2024-12-13 16:30:39 -08:00
Illia Silin	41ebf117a5	Add zstd lib for building hipTensor. (#1745 ) * add zstd library to CI docker * fix the libzstd name	2024-12-13 16:30:22 -08:00
Bartłomiej Kocot	4d8fce33dd	Add SplitK support into Batched GEMM V3 (#1729 ) * add bmm api * add bf16 multi_d * add ckProfiler for bf16 * add ckProfiler files * add more instance; fixed 64bit index issue * fixed naming * enabled batched Ds * use long_index for ds offsets * clean * add bmm fp8 ckProfiler * Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/src/profile_gemm_universal_batched.cpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com> * clean * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp * refactor batch offset func * add splitk suppport into bmm_v3 * clean * clean * format * fixed * fix --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2024-12-13 21:08:35 +01:00
chenjun	4e73177684	Ck tile/smoothquant out stride (#1742 ) * add ck_tile/smoothquant out stride parameter * Remove the default stride value --------- Co-authored-by: so <a.com>	2024-12-13 11:53:52 +08:00
carlushuang	77a38e0211	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err	2024-12-12 11:54:03 +08:00
Illia Silin	357a0b1c57	add missing stdexcept header (#1740 )	2024-12-10 15:16:03 -08:00
Illia Silin	90d8410d56	Upgrade to Ubuntu22.04 as default OS. (#1738 ) * upgrade to ubuntu 22.04 * try adding -u roof docker options for ubuntu 22	2024-12-10 08:48:51 -08:00
Jatin Chaudhary	67497a044d	Make sure we call __hneg with half to remove ambigios error (#1736 )	2024-12-10 08:47:36 -08:00
rocking	94ae7113bd	[CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731 ) * Add data type config, Prepare to add mix precision in the future * Fix compile error	2024-12-10 11:36:18 +08:00
Illia Silin	23cf2026b4	build CI for gfx12 by default (#1734 )	2024-12-09 14:11:20 -08:00
Illia Silin	2f088b8707	update CI timeout limits (#1733 )	2024-12-09 09:32:14 -08:00
Illia Silin	c773cc25a2	remove unnecessary file (#1732 )	2024-12-09 08:50:36 -08:00
Illia Silin	355893cdd8	Refactor CI performance tests. (#1726 ) * merge the build and performance tests CI stages together * add gemm performance test on gfx11/gfx12 * add suffices to distinguish gemm performance logs from different archs * use smaller gemm set in CI for gfx10/gfx11/gfx12 * disable performance tests on gfx1030 * fix the shashing logic * fix finding python3 for mha instances	2024-12-06 13:04:25 -08:00
Rostyslav Geyyer	5e6bd75a72	Add copy assignment op test (#1718 ) * Add copy assignment op test * Add a deep copy testing	2024-12-06 09:56:27 -06:00
Bartłomiej Kocot	261f1759de	Support large batch tensors in grouped conv bwd data (#1711 ) * Support large batch tensors in grouped conv bwd data * Fix multiD * fixes * fixes * fixes	2024-12-06 10:55:23 +01:00
Po Yen Chen	58e7f37fc8	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 )	2024-12-06 12:59:58 +08:00

1 2 3 4 5 ...

1617 Commits