composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	f6bb48458d	[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant pipeline (#4268 ) ## Proposed changes Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [X] I have added inline documentation which enables the maintainers with understanding the motivation - [X] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [X] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3687](https://github.com/ROCm/composable_kernel/pull/3687) 🧑‍💻 Originally authored by @ErwinTerpstra --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-10 06:57:55 -07:00
Yi DING	1ac61a54c9	[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation (#4451 ) ## Motivation This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers to compile across multiple GPU architectures by introducing compile-time availability gating and a new attribute tag mechanism for kernel symbol/attribute specialization. ## Technical Details - Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline and propagate availability handling into `QuantGemmKernel`. - Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and unique symbols. - Update the blockscale GEMM quant example to pass kernel attributes and adjust gfx950 gating. ## Test Plan - CI - Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950' -GNinja && ninja tile_example_gemm_quant` - Local test with ROCm/aiter#1954 ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-10 12:41:09 +00:00
Bartłomiej Kocot	23b32f1ff8	[CK] CK Tile grouped convolution direct load (#4406 ) ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130	2026-02-09 22:08:57 +01:00
assistant-librarian[bot]	4304c2c38e	[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280 ) ## Proposed changes gemm blockscale eightwarps support ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650) 🧑‍💻 Originally authored by @kensclin --------- Co-authored-by: KenSCLin <lshyhchy@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-09 11:54:54 +08:00
jakpiase	dfa95522d3	[CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357 ) Added support for conv v6 pipeline in ck tile's convolution forward kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline and should be faster than other pipelines for some cases. This PR also adds tests inside profiler that's currently inside experimental directory, so now we should be able to detect regressions easier. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: subhajitdchow <sduttach@amd.com>	2026-02-08 20:57:14 +01:00
Emily Martins	2a765fbbad	[CK_TILE] Fix MMA concepts compiler error (#4381 ) ## Motivation CK Tile is required to support certain older OSs; on these OSs, cpp 20 is not fully supported. For ROCm 7.2, compiler errors occur on one of these older OSs. An example of this error is as follows: ```bash /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments 34 \| { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>; \| ``` The goal of this PR is to resolve these compiler errors. ## Technical Details The existing guards around the mma concepts only check if the concepts language feature is supported, as follows: ```cpp #if defined(__cpp_concepts) && __cpp_concepts >= 201907L // ... template <typename CtrlFlags> concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) { // Flag members for Gfx9 MFMA instructions { CtrlFlags::Cbsz } -> std::convertible_to<int>; { CtrlFlags::Abid } -> std::convertible_to<int>; { CtrlFlags::Blgp } -> std::convertible_to<int>; }; #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L ``` That said, in cases where functionality from the `<concepts>` header is used (e.g., `std::convertible_to`), this guard fails to check whether the `<concepts>` header is available. This change adds an additional check to the concepts that make use of functionality from the `<concepts>` header to ensure the header is available. ## Test Plan I tested the changes on the relevant docker for gfx90a, gfx950, and gfx942 and the compiler issue is not present. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-06 16:26:57 -08:00
assistant-librarian[bot]	9c0d4114ae	[CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263 ) Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696) 🧑‍💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com>	2026-02-04 18:25:31 -05:00
Aviral Goel	b948026e16	feat: add split_k support for block scale gemm bquant mode. (#3653 ) * WIP: add splitk to bquant * feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types * chore: remove temporary test script * fix: incorrect tile window length for splitted bq tensor window * chore: improve comments * test: add unit tests to cover bquant splitk functionality * fix: conflict resolution by renaming variables [ROCm/composable_kernel commit: `3e77721755`]	2026-02-02 14:41:53 -08:00
Jan Patrick Lehr	470f031e58	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `069500464d`]	2026-02-02 09:39:48 -08:00
ZheWang	c006b10452	Mx fp6 flatmm (#3601 ) * add fp6 data-type and support sync/async dwordx3 load/store * clang-format * pre-commit * 1st commit * default mnk pass ut * fix a distrubution * fix * fix bdram distr * update * pass ut * improve perf * update * clean code * resolve copilot comment * reslove comment * clang-format --------- Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `e6bcd192d4`]	2026-02-02 16:04:40 +08:00
Po Yen Chen	59a132c68d	[CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672 ) * Change call to the intrinsics * fix clang format * Undo changes under include/ck/utility * Use named variable as vector size --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `8c1788757a`]	2026-01-30 12:02:49 -08:00
jiangyon.ren	ce51308aaf	[CK_TILE][FMHA] Add sparse attention VSA (#3341 ) * add sparse attention VSA * fix the pre-commit * Add jenga test and pre-commit * add bf16 for vsa * add jenga support bf16 * remove lse arg * split kernel code to block & kernel * fix the pre-commit * fix the pre-commit * fix the copyrights * fix the copyright * fix the copyright & rename block to pipeline * fix the copyright and pipeline * remove lse & dropout & add fmt * fix the jenga&VSA code review * remove the useless code & resolved the comments * remove useless code * remove useless code * Clean up code * Remove more unused code * Re-format .hpp * Refactor codegen scripts --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `4d2f8c111e`]	2026-01-31 00:59:47 +08:00
Erwin Terpstra	09d443a7ad	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `6a6177a246`]	2026-01-30 04:40:50 -07:00
MHYangAMD	24cf4cf9a8	Fix redundant cast in model sensitive rmsnorm (#3681 ) * Fix redundant cast * Fix linting [ROCm/composable_kernel commit: `6ff0737843`]	2026-01-30 10:52:19 +08:00
Khushbu Agarwal	68b475ad92	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition [ROCm/composable_kernel commit: `9b168082b7`]	2026-01-28 19:45:09 -08:00
Jeff Huang	29c56b8aae	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `e3556fed04`]	2026-01-29 07:18:41 +08:00
Yi DING	bb0986e59e	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `8e3d84aba3`]	2026-01-27 23:46:49 -08:00
damien-lejeune	373d8dd63d	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `91e32f305f`]	2026-01-27 12:56:09 -08:00
Illia Silin	71ac48d63a	fix some syntax errors (#3658 ) [ROCm/composable_kernel commit: `b26cb596b0`]	2026-01-27 09:59:39 -08:00
Bartłomiej Kocot	ab6bbbfee1	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt [ROCm/composable_kernel commit: `3d67e6c492`]	2026-01-27 11:04:11 +02:00
Aviral Goel	a26adffadf	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite [ROCm/composable_kernel commit: `b8751e505d`]	2026-01-26 11:27:42 -08:00
Thomas Ning	0983dea2be	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline [ROCm/composable_kernel commit: `3900e1e7ce`]	2026-01-26 10:29:28 -08:00
Emily Martins	b6f1e99074	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00
ltqin	90b3476006	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit 723b7ce0be2884da131036301892bf9157f51876. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `67f0b74ec6`]	2026-01-23 09:03:22 -08:00
Po Yen Chen	4ded7e5984	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969. [ROCm/composable_kernel commit: `de5a1d730d`]	2026-01-22 21:21:19 -08:00
kensclin	16e6a2c696	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `31a35ecab4`]	2026-01-22 09:39:38 -08:00
Bartłomiej Kocot	9c3ab51d9b	[CK TILE] Fix basic gemm pipelines (#3611 ) * [CK TILE] Fix basic pipelines * fixes [ROCm/composable_kernel commit: `44f481a45c`]	2026-01-22 08:11:18 -06:00
Linjun-AMD	f6fac4cea6	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0b13697a88`]	2026-01-22 16:07:14 +08:00
ltqin	14254656f0	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `dd0b4294af`]	2026-01-21 20:58:26 -08:00
Yi DING	a0935f7669	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 ) [ROCm/composable_kernel commit: `fcc9372c00`]	2026-01-21 09:54:46 +08:00
Max Podkorytov	8b842250da	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations [ROCm/composable_kernel commit: `91b4102a59`]	2026-01-20 10:37:09 -08:00
Linjun-AMD	e227e837be	Revert "[CK_TILE][FMHA] Add new tile size for async (#3586 )" (#3613 ) This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1. [ROCm/composable_kernel commit: `8f75869408`]	2026-01-20 09:40:54 -08:00
Bartłomiej Kocot	85c5741492	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes [ROCm/composable_kernel commit: `0727e85e52`]	2026-01-19 22:29:01 -07:00
Cong Ma	c42cd28370	[CK TILE] remove dependency on std chrono (#3599 ) * [CK TILE] remove dependency on std chrono * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0517d43d31`]	2026-01-19 15:31:02 -08:00
Linjun-AMD	ecda0fe2e9	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f3aafb9555`]	2026-01-19 15:22:33 -08:00
Adam Osewski	a9ff38bc89	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept. [ROCm/composable_kernel commit: `1a6d1b59ef`]	2026-01-19 10:54:10 +01:00
Cong Ma	487f1beee9	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 ) [ROCm/composable_kernel commit: `f9104ef9b3`]	2026-01-16 15:27:39 -08:00
Estevan Vedovelli	09d084bfb4	Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `e30207985a`]	2026-01-15 09:35:24 -05:00
Jeff Huang	445ec888ba	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout. [ROCm/composable_kernel commit: `993d3e2f0e`]	2026-01-15 22:11:44 +08:00
Khushbu Agarwal	7da4e47a5f	[CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * G128 working for both prefill and decode shapes for preshufflequant * clean up after merging with develop * fixing group 64 for decode shapes * non preshufflequant working for group size 128 * enable preshuffleb and preshufflequant with variour group sizes * reduce build time by splitting example into diff datatype files * Adding tests for preshuffleQuant * address review comment * fix for gfx1201 * compile time fix for gfx1201 * clang formatted --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com> [ROCm/composable_kernel commit: `118afa455c`]	2026-01-14 10:00:19 -08:00
Linjun-AMD	75ea587550	[CK_TILE][FMHA] Enable gpt-oss sink (#3490 ) * Enable gptoss sink Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add gptoss sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix test args error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update test_fmha_fwd.cpp * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Revert "update sink test" This reverts commit `970b4f1686`. * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update valid sink_v in splitkv pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix clangformat error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix aiter scale error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_pipeline_qr_ks_vs.hpp * div scale_s for sink_value Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * update sink_value with bias Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Fix typo in dropout parameter in fmha_batch_prefill_kernel * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * optimized some code Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix splitkv error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update sink reference Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `717ed0b59f`]	2026-01-14 21:32:06 +08:00
Thomas Ning	0c8c232a0a	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `00c46785a8`]	2026-01-13 09:21:29 -08:00
Jeff Huang	0d13ef7329	[CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326 ) This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values: 1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors. 2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero. [ROCm/composable_kernel commit: `141f77aa12`]	2026-01-13 13:52:26 +08:00
Jeff Huang	99b88be5fb	[FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545 ) - Enable page_size=1 support in batch prefill codegen (linear layout only). - Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly. - Relax `kPageBlockSize` alignment assertion for the page_size=1 case. [ROCm/composable_kernel commit: `c9f112b026`]	2026-01-13 12:04:43 +08:00
ZheWang	0a2c5c6262	fix mxfp8-gemm example failure (#3531 ) Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `a575acb245`]	2026-01-13 10:26:45 +08:00
Aviral Goel	d4718f5f31	WIP: extract MakeALdsDescriptor() from child to parent class for code readability (#3392 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `5aaa031350`]	2026-01-12 09:51:58 -08:00
Aviral Goel	3096269434	refactor: remove Default scheduler implementation as it not used anymore (#3542 ) * refactor: remove Default scheduler implementation as it not used anymore * refactor: remove dead code from gemm universal kernel * chore: add descriptive comments about amd intrinsic hardware sync instructions * fix: label existing memory pipeline for aquant as intrawave [ROCm/composable_kernel commit: `e809861d49`]	2026-01-12 09:51:06 -08:00
damien-lejeune	693548d8b2	Dlejeune/ck tile 2d multiple reductions (#3147 ) * WIP * Add Unit tests for the Multi Reduction Kernel * clang format * Rename multiblock to threadwise * Multiblock WIP * Fix multi reduce multi block unit tests * Multi Reduce Tile Engine: WIP * refactoring + try addressing precision error * Fix multiops examples * Cleanup * Clean up tile engine's reduce op * Update changelog * Fix remod/clang * Fix dates * Fix documentation & missing file * Fix comments * Use the update_tile api in the multi-block kernel * Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests * Add TileParitioner * Cleanup * Add warning when no data to process, in the example * Refactoring Reduce kernel Tile Partioner + cleanup * Move the tile partioner to its own file * Add missing includes * Fix copyright header with update_amd_copyright_headers.py * Fix change of interface in Reduce2dProblem --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `4216d43da8`]	2026-01-09 11:16:37 +01:00
Bartłomiej Kocot	5b70f71374	[CK TILE] Fix grouped conv kernels splitk and double lds (#3527 ) [ROCm/composable_kernel commit: `bc497beffb`]	2026-01-08 07:59:38 +01:00
Cong Ma	026c9200ee	[CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512 ) Refactor function amd_buffer_load_invalid_element_return_zero to avoid the inefficient ASM code generated by compiler. Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage Tested compilers: - Rocm 7.0.1 - Rocm 7.1.1 Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `d7497d2694`]	2026-01-07 00:05:56 -08:00

1 2 3 4 5 ...

668 Commits