composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
joye	37a4a9aae4	fix a CRTP issue	2025-05-21 14:13:09 +08:00
joye	88601c8a05	update async policy	2025-05-21 13:31:13 +08:00
joye	ad32373ff1	fix clang format	2025-05-21 11:56:35 +08:00
joyeamd	8470702ac0	change async pipeline's tile distribution pattern from thread to warp	2025-05-21 11:53:13 +08:00
Thomas Ning	1386924749	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format	2025-05-20 15:05:08 -07:00
Sami Remes	d1e6f0982d	[CK_TILE] Grouped GEMM tile loop (#2146 ) * Add trait to use a persistent kernel and split the entrypoints in grouped gemm * Some helper functions for persistent kernel case * Get max occupancy grid using device properties * Implement tile loop in main entry point to grouped gemm * Enable GridSize() on device * Handle offset tile index using real current block index * Add persistent kernel choice to grouped gemm example * Use a for-loop for iterating over the group * Reduce VGPR spills by early-exit * Enable persistent kernel choice in grouped_gemm example * Add persistent kernel option to grouped_gemm test * Fix formatting with remod.py * Remove GridUpdateBlocks as blocks are now iteratively computed * Add comment about VGPR spilling * Fix formatting * Use CK_TILE_HOST instead of __host__ * Enable all Row/Col combinations in grouped gemm unit test * Add some KBatch=2 cases to grouped gemm tests * Fix SplitK for grouped gemm * Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm * Add type traits * Split examples to regular and tileloop * Formatting * Use hipExtStreamGetCUMask to get current active CUs for the given stream * Align test and example kernel config, and disable validation for splitk repeats * Remove debug options from CMakeLists.txt * Separate the code paths for persistent/non-persistent in test * Fix formatting * Address review comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-20 17:18:57 +03:00
joye	5079e3f3a2	update lds descriptor	2025-05-20 17:14:29 +08:00
joye	fee156a37d	fix a descriptor issue	2025-05-20 16:58:18 +08:00
joye	33c643001e	fix some compiling errors	2025-05-20 15:52:33 +08:00
joye	55f3632901	fix a compiling error	2025-05-20 15:16:53 +08:00
joye	c71b3840e8	fix lds descriptor	2025-05-20 15:00:32 +08:00
joye	3500259bdc	update async load apis	2025-05-20 13:26:49 +08:00
jefyang1	f18170064d	Use new mfma instructions for FP8 on gfx950 (#2202 ) * Add logic to use new mfma instructions for fp8 bf8 * Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format * Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> * Fix intrin_mfma f8 calls due to merge mistake --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-19 17:29:51 -07:00
Andriy Roshchenko	57e0f5df29	MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199 ) * Unify test interface for different layouts. * WIP: Introducing FP4/FP6/FP8 abstractions * WIP: Introducing packed storage abstraction * WIP: Introducing packed storage abstraction * WIP: Improved support for FP6 data type * Refactor packed storage for f6_t * WIP: FP6 MFMA test * Test if we correctly represent all FP6/FP4 numbers * Additional output for failed FP4 test. * More failing conversion tests * Even more failing conversion tests * Working FP6 MFMA tests * Expand MX MFMA testing to BF8/6 * Update and verify MX MFMA test for packed types * Fix fp4 and fp6 conversions on host * Working MX MFMA tests for FP8/6/4 * Cleanup * Add missing type * Cleanup * Final cleanup * Restrict FP6/4 values output to CK_LOGGING=1 * Use CHAR_BIT instead of number 8 * Fix typo * Remove FP6 and FP4 from the list of native types --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>	2025-05-19 16:52:51 -05:00
joye	91113f6464	comment some not gfx950 codes	2025-05-19 10:13:49 +08:00
joye	4a544c4084	comment some not gfx950 codes	2025-05-19 10:11:49 +08:00
joye	5c936dc8f3	add ignore header file	2025-05-19 09:34:26 +08:00
joye	0f2dfa8d38	fix async inline assembly	2025-05-19 09:33:17 +08:00
joye	2a9f0fff5d	fix async inline assembly	2025-05-19 09:30:53 +08:00
joye	f0e1dbca49	fix some issues in async load	2025-05-19 08:56:03 +08:00
joye	7a03addc12	fix some compiling errors	2025-05-19 08:28:29 +08:00
arai713	5b3430b868	Narrowing error fix for codegen compilation (#2194 ) * removed comment with special characters * fix for arg/template change after merge from develop --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-05-16 11:11:54 -07:00
Mateusz Ozga	fa3c6811d8	Disable conv for Filter1x1Stride1Pad0 when K or C is even (#2186 )	2025-05-16 10:18:47 +02:00
Po Yen Chen	791802b381	[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198 ) * Write soft-sign in inline asm * Change tile idx computation * Add macro to turn off soft-sign asm opt * Use simple for loop to avoid register spill * Only do block id transform for masking cases	2025-05-16 15:14:46 +08:00
joye	9c429ad0cc	add a new pipeline for async load	2025-05-16 09:25:16 +08:00
Khushbu Agarwal	3d8d6e75e4	Adding validation for tile sizes in Tile Engine (#2189 ) * Adding validation for tile sizes * Add architecture in config, and shuffle lines of code in warp_gemm.hpp * Enable MFMA for gfx950, and invalid tile handling	2025-05-15 10:28:31 -07:00
joye	acdef41575	add a pipeline which copies from v4	2025-05-15 18:00:30 +08:00
joye	0c15891dea	fix some compiling errors	2025-05-15 14:54:13 +08:00
joye	8c09b31f72	fix a compiling error	2025-05-15 14:39:21 +08:00
joye	2fdcb55cf1	fix some compiling errors	2025-05-15 14:36:02 +08:00
joye	726ae62113	add async load api	2025-05-15 13:36:46 +08:00
joye	4dd9e8c0c8	add for async load builtin	2025-05-15 13:29:39 +08:00
BingYuan.Zhou	41c17d0a95	fix moe sorting build fail (#2190 ) * fix moe sorting build fail * refile code --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-14 09:31:26 +08:00
Illia Silin	58f9e9ffbc	Update the buffer load/store intrinsic names for clang>=20. (#2192 ) * fix the buffer load/store intrinsic names * fix clang format	2025-05-13 10:18:14 -07:00
Bartłomiej Kocot	c53b7bd22e	Switch to v2 pipeline for grouped conv bwd data (#2181 ) * Change to old pipeline for grouped conv bwd data * fix * fix * fix * fix * fix * fix * Fix	2025-05-13 10:14:30 +02:00
Po Yen Chen	2920604786	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-05-13 12:19:25 +08:00
Khushbu Agarwal	f05e45ba59	Disable SMFMA gfx90a (#2184 ) * sparsity fix for gfx90a * reverting tile_engine changes	2025-05-12 09:56:23 -07:00
Thomas Ning	b49f7de81f	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed	2025-05-12 09:52:58 -07:00
Thomas Ning	9d1e44e56a	Vectorized Transpose for Batched Transpose CK Tile Operator (#2131 ) * Shared Memory for single data point * CKTile Transpose vectorize CP1 * CKTile Transpose vectorize CP2 * CKTile Transpose vectorize CP2.1 * fixed the compile error of the transpose tile 2d * Have the correct result for the current test sample * Changes to printing tensor * fp8 support added * Debugging for transpose * solving the corner issue * Changed padding flag * Intermideate Debugging * Intermidiate Debugging * Intermediate Debugging * Finished debugging of the transpose op * Code Cleanup * Adding edge case smoke tests * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Adding Transpose test to CI/CD * Addressing Review Comment * Addressing Comments * Addressing Comments * Measuring Perf Tests * Code Cleanup * Changlog * Added the running iterations * clang format * Fix the changelog * Fix the compilation error * change the printing factor --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>	2025-05-12 00:41:45 -07:00
Khushbu Agarwal	d8faf1c6a1	Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172 ) * Changes for updating tile distribution for shuffle and transpose * Fixed swizzle and transpose, removed comments * clang formatted * Adding support for bf16 type * Addressing review comments	2025-05-10 22:40:05 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
jefyang1	6b1a339b6f	Fix grouped conv bwd data tests on gfx950 (#2173 )	2025-05-09 09:01:06 -07:00
Khushbu Agarwal	ef72a4b9bc	Disable SMFMA for gfx90a (#2182 )	2025-05-09 00:18:07 -07:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Thomas Ning	c757046d49	Revert "Disable the SMFMA instruction for gfx90a. (#2174 )" (#2175 ) This reverts commit `a32d907771`.	2025-05-08 00:07:03 -07:00
Khushbu Agarwal	a32d907771	Disable the SMFMA instruction for gfx90a. (#2174 ) * remove smfma for gfx90a * clang formatted	2025-05-07 23:09:22 -07:00
BingYuan.Zhou	6a3960c1e1	Flatmm merge (#2168 ) * sync with function interface of cshuffleepiloge,fix flatmm build fail * move code from solin/flatmm which add mfma161632fp8 and optimize flatmm --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-08 12:59:57 +08:00
jakpiase	cb07ad84d5	fix for default epilogue (#2167 )	2025-05-07 10:46:53 -07:00
Aviral Goel	769336b640	[CK_TILE] Add type traits to detect tile window types at compile time (#2158 ) * added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline * added type traits instead of enum to tile_window() and tile_window_linear() with debug comments * removed comments, added documentation and clang format	2025-05-07 00:00:39 -07:00
Rostyslav Geyyer	8a0d659f92	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-06 09:24:00 -05:00

1 2 3 4 5 ...

917 Commits