composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
valarLip	0fdbf6bcd1	extend buffer load for fp16/bf16x16 (#2270 ) * extend buffer load for fp16/bf16x16 * format	2025-06-02 10:29:54 +08:00
Illia Silin	4e561af18c	Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 )" (#2260 ) This reverts commit `fd6a859b44`.	2025-05-29 16:22:16 -07:00
joyeamd	fd6a859b44	add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues	2025-05-29 14:31:14 +02:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00
Casey-Shi	128f5a1eab	[Tile Engine] Add benchmark for tile engine gemm. (#2193 ) * initial commit -m benchmark * only support profile * fix * fix doc * add default config * add ci * fix cmake * tmp save for gen blobs * fix bug * merge * range config * test success * fix * fix * move struct * remove config property * fix config * remove comment * add cmake option & modify * add changelog * fix * format * add pydantic module to the docker image * fix * add benchmark for cold and warmp up * python format * add asm cache control * fix README * remove pydantic module * modify changelog * fix config * recover benchmark_gemm and fix * format python * refactor profiler * fix csv bug * fix codegen bug * add kernel instance object * add benchmark gemm executable * fix jenkins & delete extra header * disable warning output & enable default config * Disable sparsity for invalid warp tile combinations * fix gemm host template func * refactor gemm profiler * filter out some inmstances * default config test & fix codegen bug * add sparse flag to gen more instances --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-05-26 22:32:36 -07:00
Po Yen Chen	c42b957d65	[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask (#2209 ) * Assign block indices reversely if kHasMask=true * Assign block indices reversely for splitkv kernel	2025-05-27 10:58:58 +08:00
Zzz9990	ece38b9d7a	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit `aa9847e42d`. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com>	2025-05-26 19:17:18 +08:00
Illia Silin	8146e471f1	fix the buffer intrinsic names for clang >=20 (#2228 )	2025-05-23 14:58:25 -07:00
Illia Silin	1b846143c6	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `58f9e9ffbc`.	2025-05-22 15:41:17 -07:00
Aviral Goel	534d4594d0	Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy (#2214 ) * window_origin variable now in base class * abstracted more functions * consolidated tile_window_static_distribution and tile_window_static_lengths * clang format * skeleton code for tile_window and tile_window_linear consolidation * more abstraction * moved variables from child to parent * clang format * removed comments * removed debug code * removed debug code * abstracting traits WIP * consolidated traits * removed comments and clang formatted	2025-05-21 23:28:00 -07:00
Aviral Goel	fa39c4e798	Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution (#2160 ) * added documentation for HostTensorDescriptor * added documentation for DeviceMem and FillUniformDistribution * fixed merging error * fixed host_tensor_descriptor error * clang format	2025-05-21 10:34:30 -07:00
joye	bc4c7d26af	avoid compiling device functions when compile host	2025-05-21 16:52:58 +08:00
joye	8e3e59cb1f	fix a commented format issue	2025-05-21 15:28:16 +08:00
joye	2e296ee963	improve codes	2025-05-21 15:22:33 +08:00
joye	7377bc7200	fix some sync issues	2025-05-21 15:05:52 +08:00
joye	619508f89d	change lds layout	2025-05-21 14:35:05 +08:00
joye	ba86551534	fix a typo error	2025-05-21 14:29:49 +08:00
joye	37a4a9aae4	fix a CRTP issue	2025-05-21 14:13:09 +08:00
joye	88601c8a05	update async policy	2025-05-21 13:31:13 +08:00
joye	ad32373ff1	fix clang format	2025-05-21 11:56:35 +08:00
joyeamd	8470702ac0	change async pipeline's tile distribution pattern from thread to warp	2025-05-21 11:53:13 +08:00
Sami Remes	d1e6f0982d	[CK_TILE] Grouped GEMM tile loop (#2146 ) * Add trait to use a persistent kernel and split the entrypoints in grouped gemm * Some helper functions for persistent kernel case * Get max occupancy grid using device properties * Implement tile loop in main entry point to grouped gemm * Enable GridSize() on device * Handle offset tile index using real current block index * Add persistent kernel choice to grouped gemm example * Use a for-loop for iterating over the group * Reduce VGPR spills by early-exit * Enable persistent kernel choice in grouped_gemm example * Add persistent kernel option to grouped_gemm test * Fix formatting with remod.py * Remove GridUpdateBlocks as blocks are now iteratively computed * Add comment about VGPR spilling * Fix formatting * Use CK_TILE_HOST instead of __host__ * Enable all Row/Col combinations in grouped gemm unit test * Add some KBatch=2 cases to grouped gemm tests * Fix SplitK for grouped gemm * Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm * Add type traits * Split examples to regular and tileloop * Formatting * Use hipExtStreamGetCUMask to get current active CUs for the given stream * Align test and example kernel config, and disable validation for splitk repeats * Remove debug options from CMakeLists.txt * Separate the code paths for persistent/non-persistent in test * Fix formatting * Address review comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-20 17:18:57 +03:00
joye	5079e3f3a2	update lds descriptor	2025-05-20 17:14:29 +08:00
joye	fee156a37d	fix a descriptor issue	2025-05-20 16:58:18 +08:00
joye	33c643001e	fix some compiling errors	2025-05-20 15:52:33 +08:00
joye	55f3632901	fix a compiling error	2025-05-20 15:16:53 +08:00
joye	c71b3840e8	fix lds descriptor	2025-05-20 15:00:32 +08:00
joye	3500259bdc	update async load apis	2025-05-20 13:26:49 +08:00
joye	91113f6464	comment some not gfx950 codes	2025-05-19 10:13:49 +08:00
joye	4a544c4084	comment some not gfx950 codes	2025-05-19 10:11:49 +08:00
joye	5c936dc8f3	add ignore header file	2025-05-19 09:34:26 +08:00
joye	0f2dfa8d38	fix async inline assembly	2025-05-19 09:33:17 +08:00
joye	2a9f0fff5d	fix async inline assembly	2025-05-19 09:30:53 +08:00
joye	f0e1dbca49	fix some issues in async load	2025-05-19 08:56:03 +08:00
joye	7a03addc12	fix some compiling errors	2025-05-19 08:28:29 +08:00
Po Yen Chen	791802b381	[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198 ) * Write soft-sign in inline asm * Change tile idx computation * Add macro to turn off soft-sign asm opt * Use simple for loop to avoid register spill * Only do block id transform for masking cases	2025-05-16 15:14:46 +08:00
joye	9c429ad0cc	add a new pipeline for async load	2025-05-16 09:25:16 +08:00
Khushbu Agarwal	3d8d6e75e4	Adding validation for tile sizes in Tile Engine (#2189 ) * Adding validation for tile sizes * Add architecture in config, and shuffle lines of code in warp_gemm.hpp * Enable MFMA for gfx950, and invalid tile handling	2025-05-15 10:28:31 -07:00
joye	acdef41575	add a pipeline which copies from v4	2025-05-15 18:00:30 +08:00
joye	0c15891dea	fix some compiling errors	2025-05-15 14:54:13 +08:00
joye	8c09b31f72	fix a compiling error	2025-05-15 14:39:21 +08:00
joye	2fdcb55cf1	fix some compiling errors	2025-05-15 14:36:02 +08:00
joye	726ae62113	add async load api	2025-05-15 13:36:46 +08:00
joye	4dd9e8c0c8	add for async load builtin	2025-05-15 13:29:39 +08:00
BingYuan.Zhou	41c17d0a95	fix moe sorting build fail (#2190 ) * fix moe sorting build fail * refile code --------- Co-authored-by: solin <bingzhou@amd.com>	2025-05-14 09:31:26 +08:00
Illia Silin	58f9e9ffbc	Update the buffer load/store intrinsic names for clang>=20. (#2192 ) * fix the buffer load/store intrinsic names * fix clang format	2025-05-13 10:18:14 -07:00
Po Yen Chen	2920604786	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-05-13 12:19:25 +08:00
Khushbu Agarwal	f05e45ba59	Disable SMFMA gfx90a (#2184 ) * sparsity fix for gfx90a * reverting tile_engine changes	2025-05-12 09:56:23 -07:00

1 2 3 4 5 ...

304 Commits