composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
ThomasNing	ea5f9718ec	workable async copy under gfx950	2025-09-07 18:21:43 -05:00
joye	bc4c7d26af	avoid compiling device functions when compile host	2025-05-21 16:52:58 +08:00
joye	8e3e59cb1f	fix a commented format issue	2025-05-21 15:28:16 +08:00
joye	2e296ee963	improve codes	2025-05-21 15:22:33 +08:00
joye	7377bc7200	fix some sync issues	2025-05-21 15:05:52 +08:00
joye	619508f89d	change lds layout	2025-05-21 14:35:05 +08:00
joye	ba86551534	fix a typo error	2025-05-21 14:29:49 +08:00
joye	37a4a9aae4	fix a CRTP issue	2025-05-21 14:13:09 +08:00
joye	88601c8a05	update async policy	2025-05-21 13:31:13 +08:00
joye	ad32373ff1	fix clang format	2025-05-21 11:56:35 +08:00
joyeamd	8470702ac0	change async pipeline's tile distribution pattern from thread to warp	2025-05-21 11:53:13 +08:00
joye	5079e3f3a2	update lds descriptor	2025-05-20 17:14:29 +08:00
joye	fee156a37d	fix a descriptor issue	2025-05-20 16:58:18 +08:00
joye	33c643001e	fix some compiling errors	2025-05-20 15:52:33 +08:00
joye	55f3632901	fix a compiling error	2025-05-20 15:16:53 +08:00
joye	c71b3840e8	fix lds descriptor	2025-05-20 15:00:32 +08:00
joye	3500259bdc	update async load apis	2025-05-20 13:26:49 +08:00
joye	91113f6464	comment some not gfx950 codes	2025-05-19 10:13:49 +08:00
joye	4a544c4084	comment some not gfx950 codes	2025-05-19 10:11:49 +08:00
joye	5c936dc8f3	add ignore header file	2025-05-19 09:34:26 +08:00
joye	0f2dfa8d38	fix async inline assembly	2025-05-19 09:33:17 +08:00
joye	2a9f0fff5d	fix async inline assembly	2025-05-19 09:30:53 +08:00
joye	f0e1dbca49	fix some issues in async load	2025-05-19 08:56:03 +08:00
joye	7a03addc12	fix some compiling errors	2025-05-19 08:28:29 +08:00
joye	9c429ad0cc	add a new pipeline for async load	2025-05-16 09:25:16 +08:00
joye	acdef41575	add a pipeline which copies from v4	2025-05-15 18:00:30 +08:00
joye	0c15891dea	fix some compiling errors	2025-05-15 14:54:13 +08:00
joye	8c09b31f72	fix a compiling error	2025-05-15 14:39:21 +08:00
joye	2fdcb55cf1	fix some compiling errors	2025-05-15 14:36:02 +08:00
joye	726ae62113	add async load api	2025-05-15 13:36:46 +08:00
joye	4dd9e8c0c8	add for async load builtin	2025-05-15 13:29:39 +08:00
BingYuan.Zhou	41c17d0a95	fix moe sorting build fail (#2190 ) * fix moe sorting build fail * refile code --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-14 09:31:26 +08:00
Illia Silin	58f9e9ffbc	Update the buffer load/store intrinsic names for clang>=20. (#2192 ) * fix the buffer load/store intrinsic names * fix clang format	2025-05-13 10:18:14 -07:00
Bartłomiej Kocot	c53b7bd22e	Switch to v2 pipeline for grouped conv bwd data (#2181 ) * Change to old pipeline for grouped conv bwd data * fix * fix * fix * fix * fix * fix * Fix	2025-05-13 10:14:30 +02:00
Po Yen Chen	2920604786	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-05-13 12:19:25 +08:00
Khushbu Agarwal	f05e45ba59	Disable SMFMA gfx90a (#2184 ) * sparsity fix for gfx90a * reverting tile_engine changes	2025-05-12 09:56:23 -07:00
Thomas Ning	b49f7de81f	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed	2025-05-12 09:52:58 -07:00
Thomas Ning	9d1e44e56a	Vectorized Transpose for Batched Transpose CK Tile Operator (#2131 ) * Shared Memory for single data point * CKTile Transpose vectorize CP1 * CKTile Transpose vectorize CP2 * CKTile Transpose vectorize CP2.1 * fixed the compile error of the transpose tile 2d * Have the correct result for the current test sample * Changes to printing tensor * fp8 support added * Debugging for transpose * solving the corner issue * Changed padding flag * Intermideate Debugging * Intermidiate Debugging * Intermediate Debugging * Finished debugging of the transpose op * Code Cleanup * Adding edge case smoke tests * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Addressing Review Comment * Addressing Comments * Addressing Comments * Measuring Perf Tests * Code Cleanup * Changlog * Added the running iterations * clang format * Fix the changelog * Fix the compilation error * change the printing factor --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>	2025-05-12 00:41:45 -07:00
Khushbu Agarwal	d8faf1c6a1	Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172 ) * Changes for updating tile distribution for shuffle and transpose * Fixed swizzle and transpose, removed comments * clang formatted * Adding support for bf16 type * Addressing review comments	2025-05-10 22:40:05 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
jefyang1	6b1a339b6f	Fix grouped conv bwd data tests on gfx950 (#2173 )	2025-05-09 09:01:06 -07:00
Khushbu Agarwal	ef72a4b9bc	Disable SMFMA for gfx90a (#2182 )	2025-05-09 00:18:07 -07:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Thomas Ning	c757046d49	Revert "Disable the SMFMA instruction for gfx90a. (#2174 )" (#2175 ) This reverts commit `a32d907771`.	2025-05-08 00:07:03 -07:00
Khushbu Agarwal	a32d907771	Disable the SMFMA instruction for gfx90a. (#2174 ) * remove smfma for gfx90a * clang formatted	2025-05-07 23:09:22 -07:00
BingYuan.Zhou	6a3960c1e1	Flatmm merge (#2168 ) * sync with function interface of cshuffleepiloge,fix flatmm build fail * move code from solin/flatmm which add mfma161632fp8 and optimize flatmm --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-08 12:59:57 +08:00
jakpiase	cb07ad84d5	fix for default epilogue (#2167 )	2025-05-07 10:46:53 -07:00
Aviral Goel	769336b640	[CK_TILE] Add type traits to detect tile window types at compile time (#2158 ) * added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline * added type traits instead of enum to tile_window() and tile_window_linear() with debug comments * removed comments, added documentation and clang format	2025-05-07 00:00:39 -07:00
Rostyslav Geyyer	8a0d659f92	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-06 09:24:00 -05:00
carlushuang	4e9b76f88c	[CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153 ) * combine 2-3 as single stage * support zeroing * improve long tokens * update specialization * b16 ws * 8bit topk optimize * update 15 example	2025-05-06 17:32:07 +08:00

1 2 3 4 5 ...

816 Commits