composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	7c0e29cc0f	Extend 64x64 with 4 waves instances for grouped conv bwd wei (#2187 ) * Extend 64x64 with 4 waves instnaces for grouped conv bwd wei * Fix * fix * fix	2025-05-15 16:21:34 +02:00
BingYuan.Zhou	41c17d0a95	fix moe sorting build fail (#2190 ) * fix moe sorting build fail * refile code --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-14 09:31:26 +08:00
Illia Silin	58f9e9ffbc	Update the buffer load/store intrinsic names for clang>=20. (#2192 ) * fix the buffer load/store intrinsic names * fix clang format	2025-05-13 10:18:14 -07:00
Bartłomiej Kocot	c53b7bd22e	Switch to v2 pipeline for grouped conv bwd data (#2181 ) * Change to old pipeline for grouped conv bwd data * fix * fix * fix * fix * fix * fix * Fix	2025-05-13 10:14:30 +02:00
Po Yen Chen	2920604786	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-05-13 12:19:25 +08:00
Khushbu Agarwal	f05e45ba59	Disable SMFMA gfx90a (#2184 ) * sparsity fix for gfx90a * reverting tile_engine changes	2025-05-12 09:56:23 -07:00
Thomas Ning	b49f7de81f	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed	2025-05-12 09:52:58 -07:00
Thomas Ning	9d1e44e56a	Vectorized Transpose for Batched Transpose CK Tile Operator (#2131 ) * Shared Memory for single data point * CKTile Transpose vectorize CP1 * CKTile Transpose vectorize CP2 * CKTile Transpose vectorize CP2.1 * fixed the compile error of the transpose tile 2d * Have the correct result for the current test sample * Changes to printing tensor * fp8 support added * Debugging for transpose * solving the corner issue * Changed padding flag * Intermideate Debugging * Intermidiate Debugging * Intermediate Debugging * Finished debugging of the transpose op * Code Cleanup * Adding edge case smoke tests * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Addressing Review Comment * Addressing Comments * Addressing Comments * Measuring Perf Tests * Code Cleanup * Changlog * Added the running iterations * clang format * Fix the changelog * Fix the compilation error * change the printing factor --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>	2025-05-12 00:41:45 -07:00
Khushbu Agarwal	d8faf1c6a1	Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172 ) * Changes for updating tile distribution for shuffle and transpose * Fixed swizzle and transpose, removed comments * clang formatted * Adding support for bf16 type * Addressing review comments	2025-05-10 22:40:05 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
jefyang1	6b1a339b6f	Fix grouped conv bwd data tests on gfx950 (#2173 )	2025-05-09 09:01:06 -07:00
Mingtao Gu	a23390163d	fix moe gemm2 for gfx950 (#2164 ) Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-05-09 08:25:31 -07:00
Khushbu Agarwal	ef72a4b9bc	Disable SMFMA for gfx90a (#2182 )	2025-05-09 00:18:07 -07:00
Illia Silin	3448e12609	Generate ckProfiler package for gfx942 only. (#2180 ) * build CI for gfx942 exclusively * run the last stage in a docker with user jenkins * update the image for the last stage * ignore perf_log if not found * archive and store all packages * use ccache for building packages	2025-05-08 13:29:14 -07:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Thomas Ning	c757046d49	Revert "Disable the SMFMA instruction for gfx90a. (#2174 )" (#2175 ) This reverts commit `a32d907771`.	2025-05-08 00:07:03 -07:00
Khushbu Agarwal	a32d907771	Disable the SMFMA instruction for gfx90a. (#2174 ) * remove smfma for gfx90a * clang formatted	2025-05-07 23:09:22 -07:00
BingYuan.Zhou	6a3960c1e1	Flatmm merge (#2168 ) * sync with function interface of cshuffleepiloge,fix flatmm build fail * move code from solin/flatmm which add mfma161632fp8 and optimize flatmm --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-08 12:59:57 +08:00
Khushbu Agarwal	c7b8e86e34	[CK_Tile] Simplified Mem pipeline (#2159 ) * simplify code * compiled the code * Simplified example and codegen for mem pipeline * Reveting config and universal gemm example * clang formatted * remove comments * clang formatted * Add memory operation changes for defualt pipeline * fix config file --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-05-07 18:37:31 -07:00
jakpiase	cb07ad84d5	fix for default epilogue (#2167 )	2025-05-07 10:46:53 -07:00
Bartłomiej Kocot	397b9080a2	Move 16x16 grouped conv fwd instances from comp header (#2165 ) * Move 16x16 grouped conv fwd instances from comp header * Improvements	2025-05-07 08:04:31 -07:00
kylasa	956fe8f751	Simple copy kernel, which can be a tool to experiment with CK_Tile API with minimal code. (#2156 ) * Test Copy kernel code for testing tile distribution logic * Fix the error * Solved the problem * Updated comments and document formatting * Removed unused tile distribution and code cleanup * Added README.md and formatting for CI/CD. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-05-07 00:02:59 -07:00
Aviral Goel	769336b640	[CK_TILE] Add type traits to detect tile window types at compile time (#2158 ) * added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline * added type traits instead of enum to tile_window() and tile_window_linear() with debug comments * removed comments, added documentation and clang format	2025-05-07 00:00:39 -07:00
Rostyslav Geyyer	8a0d659f92	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-06 09:24:00 -05:00
carlushuang	4e9b76f88c	[CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153 ) * combine 2-3 as single stage * support zeroing * improve long tokens * update specialization * b16 ws * 8bit topk optimize * update 15 example	2025-05-06 17:32:07 +08:00
Muhammed Emin Ozturk	b8fa27bfef	Fix failure in test_batched_gemm_softmax_gemm_permute for lower resource devices (#2117 ) * Problematic test case are analyzed and turned off for lower resource GPUs * update device info * Update test_batched_gemm_softmax_gemm_permute_bf16_xdl.cpp * Update test_batched_gemm_softmax_gemm_permute_fp16_xdl.cpp * Update test/batched_gemm_softmax_gemm_permute/test_batched_gemm_device_utils.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com>	2025-05-05 13:12:22 -07:00
jakpiase	0bcb804ad0	[CK_TILE] Remove scratch usage from universal gemm (#2001 ) * moves kbatch condition outside of kernel * add reviewer comments * fixes * fix tests * fixes after review --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-05 18:46:44 +02:00
Andriy Roshchenko	79beaacdd1	Restrict MX GEMM instantiation to GFX950 arch (#2157 )	2025-05-05 08:18:22 -07:00
dependabot[bot]	c4e4e592c1	Bump rocm-docs-core[api_reference] from 1.18.2 to 1.18.4 in /docs/sphinx (#2161 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.18.4. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.18.4) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.18.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-05 07:29:07 -07:00
Khushbu Agarwal	d58f2b8bd0	mfma_32x32x64_fp8/bf8 (#2148 ) * support for mfma_32x32x64_fp8 * clang-formatted * Fixing sparsity in codegen	2025-05-01 13:36:24 -07:00
Illia Silin	619fba3134	re-enable ck4inductor tests by default (#2155 )	2025-05-01 12:37:27 -07:00
Andriy Roshchenko	79b0bfeb41	MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152 ) * Add gemm_mx_fp8_bf8 example with row-major B * Add more overloads of MX MFMA instructions * Add MK_KN (RRR) tests * Add KM_NK (CCR) tests * Add more problem sizes to Large tests * Add test_gemm_mx to the list of regression tests	2025-05-01 11:55:48 -06:00
Illia Silin	b9d17bdb11	add write permissions in workspace (#2154 )	2025-05-01 07:04:57 -07:00
Aviral Goel	1d8ef40760	Add documentation for ck_tile::array<T,N> (#2078 ) * addded documentation for ck_tile::array<T,N> * clang format fix * spelling errros Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * spelling errros Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Revert "spelling errros" This reverts commit `4179e7d193`. * Revert "spelling errros" This reverts commit `3f90733dbe`. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: John Afaganis <john.afaganis@amd.com>	2025-04-30 16:43:36 -07:00
spolifroni-amd	cfae863431	updated Doxyfile and added the class list (#2147 ) * updated Doxyfile and added the class list * Update Doxyfile	2025-04-30 14:58:40 -07:00
Illia Silin	9a9f59ae69	Revert "Add ck tile examples to package (#1880 )" (#2150 )	2025-04-30 10:20:16 -07:00
Bartłomiej Kocot	23de234dbe	Add grouped conv fwd 16x16 mfma instruction instances (#2140 ) * Add grouped conv fwd 16x16 mfma instruction instances * fix * remove oddc * fix * fix	2025-04-30 09:49:37 +02:00
Aviral Goel	1aea51d34e	[Tile Engine] Improved README.md (#2134 ) * improved tile_engine readme * changed ck tile explanation and json * further improved readme * fixed typo	2025-04-29 17:37:07 -07:00
Max Podkorytov	6601931949	try building ck4inductor and testing it inside a virtual environment (#2142 ) use system virtualenv use python-full ubuntu package in docker image --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-04-29 17:22:38 -07:00
Illia Silin	8fcb4dff1a	Run CI jobs as user jenkins (#2141 ) * run CI as jenkins * remove user jenkins from docker image * move inductor installation to a writeable path * add a switch for inductor tests	2025-04-29 07:35:10 -07:00
Aviral Goel	65f182d617	Add Matrix A and Matrix B Swizzle for LDS in Computev4 policy (#2136 ) * fixed computev4 policy bug for lds swizzle * added swizzle for input matrix B * Improved ComputeV4 policy and pipeline by swizzling A and B * consolidated LDS descriptor functions in parent struct	2025-04-28 18:20:47 -07:00
Khushbu Agarwal	d107f3c3a5	Support for MFMA_16x16x128 for fp8/bf8 (#2125 ) * Adding 16x16x128 support for gfx950 * Support for fp8 and bf8 * fix input arguments for MFMA scale instruction * clang-formatted * Fixes for lwpck-3145 (#2138) * Fix lds tile & cmake dep & default epilogue * Fallback BTypeToUse to ADataType in WOQ cases * reverting instance json file * reverting instance json file --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2025-04-28 18:19:50 -07:00
Khushbu Agarwal	768c99eca9	[TileEngine] Support for sparsity in codegen (#2128 ) * Added sparsity flag in codegen * remove comments * clan formatted * added sparsity as runtime argument * updated README * updated stream config variable * fix typo for tail_num in hot loop	2025-04-28 18:19:23 -07:00
Bartłomiej Kocot	4094ad158a	Integrate universal gemm with conv bwd data and add SplitK (#1315 ) * Integrate universal gemm with conv bwd data * Fix multi d kernel * Add splitK support * instances refactor * instances refactor * refactor * fixeS * fixes * 16x16 instnaces * Fixes * Fix * Fix * Fix * Fix * Fix * Fixes * fix * fix	2025-04-28 23:54:49 +02:00
Daniel Su	d9786f3363	Check max-ilp-scheduling compiler option for moe_gemm examples (#2127 )	2025-04-28 13:40:22 -07:00
jakpiase	434d19f696	Add ck tile examples to package (#1880 ) * add ck tile examples to package * Update jenkinsfile * fix for jenkinsfile * fix for building ck tile code on non gfx9 * compile ck tile examples only for gfx94 * include ck tile examples in all target * fix for basic gemm UseStructuredSparsity * Update CMakeLists.txt * Update gemm_pipeline_problem.hpp * add targets to rocm install --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-04-28 09:53:19 -07:00
lalala-sh	83394e40d2	fix moe i4 example bug (#2139 )	2025-04-28 09:49:31 -07:00
Anton Gorenko	edd92fc546	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on `e5ad48a784` * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com>	2025-04-28 10:14:21 +05:00
Yi DING	8add2cf45d	Fix fp8 convert & add option for basic example (#2129 )	2025-04-27 16:26:05 -07:00
Po Yen Chen	3d4d70d2fc	Avoid using store_tile_raw() for fp32 tensors (#2072 )	2025-04-26 23:07:41 -07:00

1 2 3 4 5 ...

1896 Commits