composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
Ding, Yi	f122fc731f	fix	2026-04-23 03:46:52 -05:00
Ding, Yi	5c9134f72b	Merge remote-tracking branch 'origin/develop' into users/yiding12/fmha-bwd-workspace	2026-04-23 01:50:54 -05:00
Illia Silin	cfb09d76a5	[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 ) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Sami Remes	1e4eebfba8	[CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611 ) `MakeAsyncLoadADramWindow` in `GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes strides from lengths and assumes a dense layout. When the input view's leading-dim stride is larger than its inner length (non-packed memory layout), the resulting tile window stepped through memory at the wrong stride. Compose the unmerge transforms on top of the input view's existing descriptor instead, so the actual runtime strides are preserved and the correct `element_space_size` is inherited for bounds checking. ## Test Plan Added an unit test showing the problem. ## Test Result The new test fails before fixes and passes after. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 12:52:02 +02:00
Yi DING	30f2874063	Merge branch 'develop' into users/yiding12/fmha-bwd-workspace	2026-04-22 15:17:57 +08:00
Ding, Yi	3607588ca4	FMHA BWD workspace: 4K-align dq_acc base Mark the FmhaBwdWorkspaceManager size/offset accessors as CK_TILE_HOST (they are only invoked from host-side workspace setup), and pad GetWorkspaceHostSize up to a 4K boundary so the GPU dq_acc buffer always starts on a page-aligned offset.	2026-04-22 02:08:34 -05:00
Linjun-AMD	803874c73b	CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479 ) ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details Pipeline (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. Traits / codegen (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. Runner (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-21 11:03:55 +00:00
金黄色葡萄球君君	b5b3ba728d	projects/composablekernel: add SwigluStep support for MoE blockscale (#6118 ) ## Summary - add `swiglustep_and_mul` to the composablekernel MoE blockscale activation enum - implement the corresponding blockscale epilogue path for `SwigluStep` - keep existing `silu` and `gelu` paths unchanged ## Scope This PR covers the classic composablekernel blockscale MoE path under `projects/composablekernel`. This is separate from the `ck_tile` / FlatMM path being discussed in ROCm/rocm-libraries#5992. ## Motivation `Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The dependent AITER change needs native support for this activation in the classic composablekernel MoE blockscale path. ## Validation - patch is limited to two composablekernel files under `projects/composablekernel` - existing `silu` / `gelu` paths are unchanged - dependent AITER runtime validation hit the classic CK 2-stage path with AITER MoE enabled	2026-04-21 07:24:48 +00:00
Hosang Yoon	2574f37483	[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253 ) ## Motivation - On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA forward adds noticeable overhead and causes a meaningful performance gap versus fp16. The asm-based path (mode 3) does not improve this on RDNA and can perform even worse. - In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower than the corresponding fp16 path. - This PR reduces that gap by switching FMHA forward to a different BF16 conversion strategy based on Triton’s canonical-NaN round-to-nearest-even behavior. ## Technical Details - Add a new `standard_cnan` BF16 conversion mode to CK Tile. - Implement a canonical-NaN RTN `float -> bf16` conversion path based on the Triton implementation. - Enable this conversion mode by default for FMHA forward builds targeting gfx11/gfx12. - Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some `hdim=128` cases to keep kernel selection aligned with the updated conversion behavior. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - all tests passed when running `test_ck_tile_fmha` - BF16 FMHA forward performance improves by up to ~5% on gfx11. - BF16 FMHA forward performance improves by up to ~10% on gfx12. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:52:24 -04:00
Bartłomiej Kocot	8fd401803f	[CK][CK Tile] Clamp element space size to max int32 value (#6168 ) ## Motivation Fix oob check by clamping element space size to avoid overflow when tensor is larger than 2GB. ## Technical Details - It is possible that tensor could be larger than 2GB but offsets no, so element space size must be clamped to 2GB if value is larger. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. https://github.com/ROCm/composable_kernel/issues/3722 Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-20 15:32:24 +00:00
Yung-sheng Tu	91b7dae95a	[CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857 ) ## Motivation The existing unification work supports DENSE and SPARSE intrinsics. In this PR, we enable support for SCALE intrinsics and add example SCALE implementations. ## Technical Details Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE intrinsics, and adding WMMA SCALE policy trait. Note: fp6 SCALE intrinsics support is not included in this PR, as its handling in ck_tile is currently more specialized and does not follow the same pattern as other datatypes. ## Test Plan Added new tests for the relevant SCALE specialisations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:28:23 +00:00
Zoltán Lakatos	f73bfe1b7e	[CK] Remove code duplications in grouped gemm fixed nk implementations (#4961 ) ## Motivation Different flavours of grouped gemm fixed nk implemenations share the same block to tile mapping logic. Despite that the code responsible for it is duplicated in each device struct implementation. - Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and `OffsettedBlockToCTileMapMLoops` from the device struct implementations to a common header file. - Use the generic Kernel Argument structures in xdl versions of the fixed nk. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan CI in general. Relevant test and examples are all fixed_nk versions of grouped gemm multiple D and ABD. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-20 12:24:59 +00:00
Hosang Yoon	f5e00ec904	[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450 ) ## Motivation `qr_hpad` currently executes work for padded head-dim fragments even when only a subset of the values are valid. This adds unnecessary computation for head dimensions that require padding, such as `hdim=72` and `hdim=80`, and hurts FMHA forward performance. The goal of this PR is to make the padded-head-dim path skip invalid work based on the actual valid fragment count, while preserving the existing behavior for the non-padded path. ## Technical Details This PR improves the `qr_hpad` FMHA forward path in three parts: - Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part of the fragment is valid. - Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the valid fragment range instead of always computing over the padded extent. - Retune the gfx11 `qr_hpad` kernel configuration after enabling the partial-fragment path. To keep the existing path stable, the implementation adds overloads for the updated GEMM/pipeline interfaces. This allows existing full-tile callers to keep using the previous form, while the `qr_hpad` path can pass valid fragment counts when needed. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - On gfx11 and gfx12, for head dimensions that require padding, `tile_example_fmha_fwd` shows about 20-30% performance improvement at `hdim=72/80`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-18 06:44:46 +00:00
Ville Pietilä	7d6ef2396f	[MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421 ) ## Motivation Due to compiler version update, there are test failures in the test target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current develop branch, the test failures are blocking any PR merges into develop. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as the tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. First attempt to fix this problem had to be reverted in #6400 because it broke MIOpen internal DB sync tests. ## Technical Details The root cause for the test failures are the block-GEMM V5 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size. The V5 pipeline uses double register buffer that in combination with large tile size causes high register pressure. The latest version of compiler handles the register spillage incorrectly for `gfx90a`, which cause the kernel to output incorrect results. The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that do not use direct load for are divided into two groups - Base instances - Instances that result into high register usage (currently only one instance - one that causes the test failures). This division allows to disable only the V5 block-GEMM flavor of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The selective disabling leaves the set of instances for 1D and 2D convolutions unaffected, and removes at runtime two V5 block-GEMM instances (`ConvBwdWeightDefault` and `ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the device is `gfx90a`. Because MIOpen uses CK's type string (provided by method `GetTypeString`) to identify the instances, the DB sync tests are expected to unaffected since there are still the V2 block-GEMM instances that result in the same type string (`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by running the MIOpen DB sync tests that are not part of the normal CK PR build. ## Test Plan Running all CI tests + the MIOpen internal DB sync tests is sufficient to verify the correctness of the code changes. ## Test Result Verified locally that the previously failing tests `TestGroupedConvndBwdWeight3d/4.Test3D` and `TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts - 231 on `gfx90a` - 233 on `gfx942` and are currently passing. This confirms the expectation that two instances per data type should be disabled on `gfx90a`. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Ville Pietilä <>	2026-04-17 09:16:32 +03:00
Max Podkorytov	027b95a21c	[CK_TILE] Add CShuffleLds microbenchmark suite (#5383 ) ## Summary Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue for bank conflict analysis. ## Motivation CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS → registers for coalesced global writes). This suite isolates each operation to: - Identify which operation causes bank conflicts - Measure pure LDS bandwidth per access pattern - Validate access patterns across MFMA tile sizes and wave layouts ## Components - Microkernels (`tile_load_store_microkernels.hpp`): `StoreTile<Setup>`, `LoadTile<Setup>` - Setup Adapters (`benchmark_cshuffle_lds.hpp`): Wire CShuffleEpilogue to microkernels - Template (`benchmark_template.cpp.in`): Generated benchmarks with timing ## Build ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 ``` ## New CMake Options \| Option \| Default \| Description \| \|--------\|---------\|-------------\| \| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` \| OFF \| LDS microbenchmarks \| \| `BUILD_CK_TILE_FMHA_TESTS` \| ON \| FMHA tests \| \| `BUILD_CK_TILE_ENGINE` \| ON \| Tile engine \| \| `BUILD_CK_TILE_ENGINE_TESTS` \| ON \| Tile engine tests \| \| `BUILD_CK_EXAMPLES` \| ON \| Examples \| \| `BUILD_CK_TUTORIALS` \| ON \| Tutorials \| \| `BUILD_CK_DEVICE_INSTANCES` \| ON \| Device instances \| \| `BUILD_CK_PROFILER` \| ON \| Profiler \| Setting guards to OFF reduces cmake configure from ~150s to ~5s. --------- Made-with: Claude Code, Opus 4.5	2026-04-14 20:43:23 -07:00
msaffari-amd	1ae4925302	[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863 ) ## Motivation The PermuteN epilogue was previously embedded within cshuffle_epilogue.hpp, despite having fundamentally different behaviour. Coupling these two independent strategies in one file introduced unnecessary complexity, SFINAE guards, and a dual operator() overload selected at compile time via TiledMMAPermuteN_ template parameter. This PR separates PermuteN into its own standalone file(pertmuten_epilogue.hpp), simplifying both implementations and making the codebase easier to maintain and extend independently. ## Technical Details New file: permuten_epilogue.hpp: contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from the permuteN code path in cshuffle_epilogue.hpp. Cleanup of cshuffle_epilogue.hpp: - Removed the TiledMMAPermuteN_ template parameter from [CShuffleEpilogueProblem] - Removed the SFINAE-guarded permuteN operator() overload - Removed the EnablePermuateN_ SFINAE alias - CShuffle now only contains CShuffle logic; EightWave support (independent feature) is retained Consumer migration : All consumer files now use compile-time epilogue selection via [std::conditional_t] `using GemmEpilogue = std::conditional_t< TiledMMAPermuteN, PermuteNEpilogue<PermuteNEpilogueProblem<...>>, CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;` Files modified: - flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp, mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples - run_gemm_quant_example.inc — block-scale GEMM example - gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker - test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp, test_gemm_pipeline_util.hpp — test utilities - universal_gemm_invoker.hpp — universal GEMM invoker - epilogue.hpp — add header updated to include permuten_epilogue.hpp ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-14 20:22:18 +00:00
arai713	c810a01ec6	[CK_TILE] Restructure Tile Engine's benchmarking and profiling (#4769 ) ## Motivation This PR introduces a restructure for the benchmarking and profiling aspects of CK Tile's Tile Engine, expanding on the groundwork from this previous https://github.com/ROCm/composable_kernel/pull/3434 and outlined in this [design document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX). In PR 3434, to reduce repeated code we implemented: - Base class that centralizes common functionality and provides a default implementation (Universal GEMM) - Child classes for GEMM variants override virtual functions to handle variant-specific behavior This refactoring in this PR follows the same process and should greatly reduce the duplicated code present in Tile Engine and make it simpler to add in new operations, increasing scalability. ## Technical Details The files have been refactored around new base structs for benchmarks, profiling and problem descriptions. The new base structs are: - GemmProblem - GemmBenchmark - GemmProfiler Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes that will inherit from these base structs overriding only what differs per variant. All common functions across the benchmarking and profiling files have been moved into newly added common utility files under the commons/ directory. The new utility files are: - utils.hpp: common functions for the benchmarking and profiling process - benchmark_utils.py: common utility functions for the benchmark generation ## Test Plan I tested using the existing tests for Tile Engine. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 10:50:24 -07:00
Estevan Vedovelli	43b33b9034	[ck] Clamp negative kernel execution elapsed time to zero (#6379 ) ## Motivation hipEventElapsedTime can return a small negative value on Windows when timing a very fast kernel launch on the null stream. This caused consumers of launch_and_time_kernel to receive a negative elapsed time, which they reasonably treat as an error, breaking otherwise-correct kernel executions. ## Technical Details After calling hipEventElapsedTime, a clamp is applied in launch_and_time_kernel before the result is returned, avoiding the return of a physically impossible elapsed time. The negative value from hipEventElapsedTime has been observed on Windows. For kernels that complete in well under a millisecond, the HIP event timestamps can alias such that the computed difference is a small negative number (observed: ~-1.78 ms). No HIP error is reported by any surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError), confirming the kernel itself executed successfully. ## Test Plan - Recompile CK and validate no kernel execution reports a negative elapsed time during hipTensor tests. - Pass the CI/CD pre-checking tests for CK. ## Test Result - All tests passing ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 09:14:26 -07:00
Po Yen Chen	470a48530b	[CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342 ) ## Summary - Skip fp16 FMHA forward dropout tests that use the d256 tile in batch mode, gated on compiler version - The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of Philox RNG parameters under high register pressure (383 VGPRs) - bf16 dropout tests are unaffected and cover the same code paths ## Root Cause The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state stored in VGPRs) with other live data during the softmax main loop. This causes corrupted `buffer_store_byte` writes for dropout randval on wave lanes 32-63, producing NaN in output and LSE tensors. Conditions: fp16 + d256 tile + dropout + batch mode + `qr` pipeline + gfx90a ## Changes - `include/ck_tile/core/config.hpp`: Add `CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro - `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in `TEST_P(Dropout, ...)` ## Test plan - [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip active) - [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED - [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)	2026-04-14 14:07:20 +00:00
Yaswanth Raparti	d988d55275	[CK][CK TILE] Modify elementwise kernel template signature to accept independent type arguments (#6399 ) ## Motivation modify elementwise kernel template signature to fix cshuffle epilogue build error ## Technical Details Encountered a build error while building conv fallback kernel with dispatcher. Error: Type mismatch in `ElementWiseKernel::operator()` where the template required all three parameters (lens, input_strides, output_strides) to be the same type, but the CShuffle epilogue was passing them with different tuple element types. Solution: Modified the template signature in elementwise_kernel.hpp to accept three independent type parameters: Changed from single typename `Dims` to typename `DimsLens`, typename `DimsInStrides`, typename `DimsOutStrides` Updated references to `Dims::size()` to use the appropriate specific type ## Test Plan - Test with dispatcher conv unit tests - Relying on CI tests ## Test Result - Dispatcher unit tests passed - Relying on CI tests ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 01:44:27 -06:00
chris-tsiaousis-hpc	89c5e67028	[CK Tile] Unification work - mma transformations pipeline (#5508 ) ## Motivation In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing. For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform. A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR. The pipeline base class is basically at the RFC stage. We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline. ## Technical Details The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed. ## Test Plan New tests are added. ## Test Result Tests should pass. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-04-14 09:25:01 +02:00
Brock Hargreaves	5eee93e67c	[MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400 ) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	6e0454216d	[CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343 ) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Kiefer van Teutem	6cd016dde4	[CK Tile] Add Tile Distribution Encoding Calculator (#5515 ) ## Motivation We want to be able to calculate TileDistributionEncodings describing register mappings for any MmaOp. This is necessary for further integration with CK Tile. This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma type (MmaOp) and provides ABC warp distribution encodings for mapping matrix fragment coordinates to register coordinates (lane, vector item) and vice versa. It is able to take CTranpose, Swizzle, and NumAccessA / NumAccessB template parameters for tweaking the tile distributions. Swizzle modification will be implemented later. The current implementation can deal with all intrinsic types and block-hiding. This MR also adds some additional static asserts and derived params within amdgcn_mma_base, to enforce consistency and help calculate Tile Distributions for block-hiding intrinsics. An Example was added that uses the Tile Distr Enc Calc to calc and print register layouts for Tile Distributions for some of our amdgcn_mma structs. It also makes sure that the CTranspose modifier works as intended. Some additional gfx9 intrinsics were added to test block-hiding layouts for the different types of C-block-hiding layouts. The sparse intrinsic wrappers were updated according to Chris's recent changes in another branch (https://github.com/ROCm/rocm-libraries/pull/5508), which moved the compression step outside of the intrinsic itself. This is necessary to make sure that the Calculator can deal with this new interpretation of the sparse intrinsics. I directly copied the new amdgcn structs from Chris's branch and changed nothing else to avoid more complex merges in the future. Note that this means I did not update a bunch of related sparse code since that would be a lot, and therefore I disabled test_amdgcn_sparse_mma for now. The amdgcn_mma_layout test was refactored a bit: - The old register mapping utility was removed and its use was replaced by the new TileDistrEncCalc - More tests were added to test layouts for different types of block-hiding and sparse intrinsics - The Selector method was removed and the tests were split up over target architectures, with each target arch having a direct list of amdgcn structs to be tested. This ensures that we force specific tests on specific architectures and makes sure that the selector doesn't quietly do some workarounds like creating compound intrinsics. ## Test Results Layout tests based on calculated tile distribution encodings pass on all architectures. Calculator works for all currently added amdgcn structs, which includes different types of block-hiding and sparse intrinsics. Printed layouts from new example verified by eye. CTranspose modifier tested for large set of intrinsics.	2026-04-13 08:00:31 +00:00
Aviral Goel	4ccbcbe0a4	CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 ) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Hosang Yoon	65ad35becd	[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156 ) ## Motivation On gfx11/gfx12, FMHA forward kernels that require head-dim padding show a large performance drop compared to the exact-head-dim path. In practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too far off the fast path. This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while keeping the behavior for other GPUs unchanged. ## Technical Details - Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path for gfx11/gfx12. - For `receipt=0`, keep support conservative and only enable the padded fast path for vector-safe cases (`head_dim % 8 == 0`), matching the existing assumption used on other GPUs. - Move `v_prefetch` later only for the head-dim-padded path on gfx11/gfx12. This reduces live ranges and removes the register-spill behavior seen in the earlier scheduling. - Enable the buffer-load OOB check offset trick for the padded path on gfx11/gfx12. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result Observed padded-head-dim performance improvements for HDIM=72/80: - gfx11: about ~3.5x - gfx1151: about ~2.0x - gfx12: about ~1.3x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 10:51:53 -04:00
Ding, Yi	bedd60a568	Merge remote-tracking branch 'origin/develop' into users/yiding12/fmha-bwd-workspace	2026-04-07 23:13:09 -05:00
Ding, Yi	5a63b343d0	Fix	2026-04-07 23:12:39 -05:00
Christopher Millette	870112b861	[CK_TILE] Flatten nested static_for loops into static_ford (#5939 ) ## Summary Mechanical conversion of 129 nested `static_for`/`static_ford` patterns to flat `static_ford` across 29 ck_tile header files. Each conversion eliminates intermediate lambda closure instantiations by replacing nested compile-time loops with a single flat iteration using index decomposition. ### What `static_ford` eliminates When `static_for` loops are nested, each level creates unique closure types: ```cpp // BEFORE: M + M×N = 20 IR functions (for M=4, N=4) static_for<0, 4, 1>{}([&](auto m) { // 4 closure instantiations static_for<0, 4, 1>{}([&](auto n) { // 4×4 = 16 closure instantiations body(m, n); }); }); // AFTER: M×N = 16 IR functions (with ford_applier, no intermediates) static_ford<sequence<4, 4>>{}([&](auto mn) { constexpr auto m = number<mn[number<0>{}]>{}; constexpr auto n = number<mn[number<1>{}]>{}; body(m, n); }); ``` ### Pattern categories converted \| Category \| Count \| Description \| \|----------\|-------\|-------------\| \| C (2-level `static_for` chains) \| 112 \| Nested `static_for` → `static_ford` \| \| C3 (3-level `static_for` chains) \| 9 \| Three consecutive nests → `static_ford` \| \| Partial rescue \| 3 \| Outer 2 levels of blocked 4-level nests \| \| B (nested `static_ford` merge) \| 5 \| Two nested `static_ford` → single higher-dim `static_ford` \| \| Total \| 129 \| Across 29 files \| 6 false positives were detected and reverted (in `tensor_adaptor.hpp`, `tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the inner loop bound depended on the outer variable. ### Files changed by family \| Family \| Files \| Sites \| \|--------\|-------\|-------\| \| Block GEMM \| 12 \| ~20 \| \| FlatMM pipelines \| 4 \| ~69 (including 5 ford-ford merges) \| \| GEMM quant \| 7 \| ~22 \| \| FlatMM kernel \| 1 \| 2 \| \| FMHA \| 1 \| 2 \| \| Reduce/norm \| 2 \| 2 \| \| Epilogue \| 1 \| 1 \| ### Blocked locations from review comments - block_gemm_areg_breg_creg_v1.hpp:356 — BLOCKED: runtime scale loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between every nesting level - block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228 — BLOCKED: `zero_accumulators()` before inner loop; `sched_barrier` + conditional `block_sync_lds()` after inner loop - block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298 — BLOCKED: runtime `CWarpTensor` construction before inner loop; quantization scale application code after inner loop - block_universal_gemm_as_aquant_bs_cr.hpp:277 — BLOCKED: same pattern as above - block_universal_gemm_as_bs_bquant_cr.hpp:367 — BLOCKED: same pattern as above ## Depends on - #5938 ([CK_TILE] Optimize static_ford and sequence compile-time infrastructure) — provides the `ford_applier` that makes these conversions beneficial. Without it, `static_ford` uses a recursive implementation that provides no IR function savings. ## Results (combined with #5938) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|-------------\| \| flatmm \| 161.1 \| 149.0 \| -12.1s \| -7.5% \| YES (p<0.01, 7/7 wins) \| \| universal_gemm \| 225.4 \| 220.3 \| -5.1s \| -2.3% \| YES (p<0.01, 7/7 wins) \| ### IR Function Counts (device trace, gfx942) \| Target \| InstFunc \| CodeGen \| \|--------\|----------\|---------\| \| universal_gemm \| -8.5% \| -9.2% \| \| flatmm \| -7.6% \| -10.5% \| ### ASM Equivalence 5/5 PASS — 650,151 lines verified identical (gfx942). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] ASM equivalence verified (650K lines, gfx942) - [x] Wilcoxon timing verified (7 trials, p<0.01) - [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-07 08:36:45 -06:00
Po Yen Chen	6dc44114ba	[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051 ) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 22:19:28 +08:00
Jeff Huang	449844e3d3	[CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201 ) ## Summary - Temporarily remove the KV cache offset overflow assert checks in `FmhaBatchPrefillWithPagedKVCacheKernel` - The asserts are correct, but they block project progress in certain configurations - This is a temporary workaround to unblock progress; a proper fix will follow ## Note This is NOT a permanent solution. A follow-up PR will add proper overflow handling that addresses the underlying issue without blocking progress.	2026-04-07 20:41:24 +08:00
Ding, Yi	3848d2411a	Merge origin/develop into users/yiding12/fmha-bwd-workspace	2026-04-07 05:28:49 -05:00
Bartłomiej Kocot	4112e08d0c	[CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090 ) ## Motivation Improve accuracy of conv bwd data perf measurements ## Technical Details - enable flush cache - for grouped conv we zero conv input(gemm output) inside device op, so we also include this in time measurement - for non-grouped conv we zero conv input(gemm output) outside device op (in profile_conv_bwd_data_impl.hpp) so it is not included. - In this pr I changed it to include zeroing if time_kernel/flush cache is enabled so at now you should have more fair comparison. I changed it only for time_kernel/flush_cache because MIOpen run own zeroing for non-grouped solvers. ## Test Plan test_grouped_conv_bwd_data_* ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-04 00:22:22 +00:00
harkgill-amd	12fe2c3de4	Add missing gfx1033 to gfx103 group definition in ck (#5141 ) ## Motivation Resolving PyTorch build failures when enabling builds for gfx103X-all family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033` is the only failing architecture in the family and the failures point to missing support in CK. ## Technical Details PyTorch build fails with repeated error message ``` /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 33 \| wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` `gfx1033` is missing from the `__gfx103__` group which results in `CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in `gfx1033` to the missing files which should be the minimum fix to allow torch builds to pass. ## Test Plan Compile sample test file and target gfx1033 ``` ... #ifdef __HIP_DEVICE_COMPILE__ static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value"); #else static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value"); #endif ``` ## Test Result Prior to the applying patch, compilation fails with `error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'` After applying patch, test file compiles successfully. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-03 13:44:38 -06:00
Ding, Yi	28afc8fee3	[CK_TILE] Use Unified Workspace for FMHA BWD	2026-04-03 03:54:38 -05:00
Thrupti Raj Lakshmana Gowda	8c42eaf3f8	[CK Tile] Fix architecture-dependent EightWave assignment in cshuffle_epilogue (#6102 ) ## Motivation Tile engine CI build on the develop branch started failing after a recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in `cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally computed as `(MWave * NWave == 8)` for all architectures, but this logic is only valid for gfx9. On other architectures (e.g., gfx1201), `EightWave` must always be `false`, otherwise it leads to incorrect `BlockedXDLN_PerWarp` computation and build failures. ## Technical Details In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as: ```cpp static constexpr bool EightWave = (MWave NWave == 8); ``` This was applied regardless of the target GPU architecture. The fix uses a preprocessor guard to make this architecture-aware: - *gfx9 (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave == 8)` — true or false depending on the wave configuration - All other architectures: `EightWave` defaults to `false` ## Test Plan - Tile engine CI build on develop branch ## Test Result - Pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --- > Note: This PR supersedes ROCm/rocm-libraries#5436, which is blocked pending a review approval from a reviewer currently on PTO. The same changes have been applied to this branch (`users/tlakshma/ck/develop-clone`) to allow merging. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-04-02 20:55:03 -06:00
Christopher Millette	522902f29b	[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938 ) ## Problem Each `static_for<0, N, 1>` instantiates its lambda N times (one per `number<I>` type). When nested, intermediate lambdas capture the outer loop variable (a different type per iteration), creating unique closure types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32 = 52 IR functions, of which 20 are intermediate closures that get inlined away but still cost frontend compile time. ck_tile's `static_ford` was supposed to eliminate these intermediates (as old CK's PR #5031 did successfully), but it used a recursive `static_ford_impl` that recreated the same closure pattern plus added `reorder_old_to_new`/`reorder_new_to_old` overhead. Additionally, the sequence utility layer (`sequence_sort`, `is_valid_sequence_map`) used recursive template metaprogramming that generated O(N log N) intermediate types for every permutation validation — called on every `reorder_new_to_old`/`reorder_old_to_new` invocation. ## Changes ### 1. Replace `sequence_sort` with constexpr insertion sort Replace recursive merge sort (`sequence_sort_impl` + `sorted_sequence_merge_impl`, O(N log N) intermediate type instantiations) with constexpr insertion sort using `static_array`. O(1) template depth, same `::type` and `::sorted2unsorted_map` API. ### 2. Replace `is_valid_sequence_map` with constexpr check Replace sort-based permutation validation (which instantiated the full `sequence_sort` chain) with a constexpr "seen array" loop. O(N) constexpr steps instead of O(N log N) template instantiations. ### 3. Replace recursive `static_ford` with flat-loop `index_decomposer` Replace `static_ford_impl` (recursive `static_for` nesting + `pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat `index_decomposer` using pre-computed strides. Add `decompose_reordered` alias that folds reordering into decomposition, and `inverse_perm` helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map` → `sequence_sort` chain. ### 4. Eliminate internal lambda via `ford_applier` The flat-loop approach still used `static_for` with a lambda, creating M×N internal lambda instantiations per call site. Replace with `ford_applier` struct that calls `f(decompose<I>{})` directly via fold expression — zero intermediate closures: ```cpp // Before: 2×M×N function instantiations static_for<0, MN, 1>{}([&](auto i) { f(decompose<i>{}); }); // After: M×N function instantiations (50% reduction) ford_applier<Decomposer, make_index_sequence<MN>>{}(f); ``` Also unified identity and non-identity order paths into a single template with `constexpr if`. ### 5. Fix const-qualified sequence handling Fix `is_valid_sequence_map` to handle const-qualified sequence types via `remove_cvref_t` in callers (`tensor_adaptor.hpp`, `tile_distribution_encoding.hpp`). ## Results (this PR only, without flattening) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Wins \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|------\|-------------\| \| flatmm \| 160.1 \| 152.7 \| -7.4s \| -4.6% \| 6/7 \| YES (W+=1, p<0.05) \| \| universal_gemm \| 228.4 \| 224.7 \| -3.7s \| -1.6% \| 6/7 \| Trending (W+=4) \| Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5] Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6] ### IR Function Counts (device trace, gfx942) \| Target \| Metric \| Before \| After \| Delta \| % \| \|--------\|--------\|--------\|-------\|-------\|---\| \| universal_gemm \| InstantiateFunction \| 117,715 \| 109,165 \| -8,550 \| -7.3% \| \| universal_gemm \| CodeGen Function \| 47,912 \| 45,044 \| -2,868 \| -6.0% \| \| flatmm \| InstantiateFunction \| 100,939 \| 95,127 \| -5,812 \| -5.8% \| \| flatmm \| CodeGen Function \| 42,651 \| 40,367 \| -2,284 \| -5.4% \| Note: The `ford_applier` (commit 3) has minimal additional effect in this PR since ck_tile code does not yet use `static_ford` extensively. Its impact compounds when the follow-up flattening PR #5939 converts 124 `static_for` nests to `static_ford`. Combined results with #5939: flatmm -7.5% wall time (p<0.01), CodeGen -10.5%. ### ASM Equivalence 7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 + gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] `test_ck_tile_static_ford`: 13 behavioral tests (identity/non-identity orders, 1D-4D, unit dimensions, edge cases) - [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map, is_valid_sequence_map edge cases, sequence_unique_sort map round-trip) - [x] ASM equivalence verified (980K lines) - [x] Wilcoxon timing verified (7 trials, flatmm p<0.05) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 15:25:14 -06:00
Linjun-AMD	ba0efe01af	[CK Tile] Add sink token gradient support in FMHA backward pass (#5504 ) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 11:17:01 +08:00
Chinmay Dattanand Kuchinad	820ed2dbb3	[CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776 ) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-04-01 09:21:20 -07:00
aledudek	357a140e7b	[CK_TILE] Add pooling in tile_engine (#4469 ) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-01 07:31:46 +00:00
Yi DING	9b8b2456b4	[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991 ) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 13:44:04 +08:00
Estevan Vedovelli	2510e7b238	[CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022 ) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:18:11 -07:00
Bartłomiej Kocot	f14ee90152	[CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842 ) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 10:02:24 +02:00
jakpiase	0b98317983	[CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729 ) ## Motivation This PR changes gemm/convolution cshuffle layout into plain one. to improve cshuffle operation performance. ## Technical Details The purpose is that before this change the cshuffle layout was having some descriptor transformations that were probably aimed at reducing LDS bank conflicts, but the transformations itself were terribly slow, which negatively impacted the performance. ## Test Plan There is no need for additional tests, since current tests cover this functionality.	2026-03-31 03:39:03 +00:00
Illia Silin	3873bf3b91	[CK] fix clang lifetimebound errors with staging compiler (#5921 ) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 07:19:32 -07:00
Yi DING	4a1abd0e31	[CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915 ) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 09:44:35 +08:00
Jan Patrick Lehr	d1327bedb7	[CK] More lifetime-warning suppression (#5639 ) ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:19:46 +00:00
Linjun-AMD	e40a675f74	[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095 )" (#5849 ) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:36:39 +00:00
Bartłomiej Kocot	dad85b964c	[CK] Fix min k_batch calculation in conv kernels (#5785 ) ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:37:37 +00:00
Johannes Graner	c60514f371	[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393 ) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 10:17:10 +01:00

1 2 3 4 5 ...

1575 Commits