mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-13 09:45:56 +00:00
7dcc606adcbca072fbd6eefbfd74ae0d0d923b3e
1551 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2169367735 |
[rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5)
[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114) ## Motivation Improve profiling for grouped convolution forward for better comparison between CK and CK Tile ## Technical Details - Include preprocessing time for ck tile - Add flush cache for conv fwd profiler - Switch configs to builder reflect - Add KPerXdl deduce - Add non-grouped ported instances ## Test Plan test_grouped_convnd_fwd_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-786 |
||
|
|
56e1d5da08 |
[rocm-libraries] ROCm/rocm-libraries#5028 (commit 5131491)
[CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Rationale `ck_tile::sequence` is the most fundamental metaprogramming type in ck_tile — it underpins tensor dimensions, strides, loop bounds, and index calculations. Six of its metafunctions use recursive template instantiation, producing O(N) to O(N²) intermediate types that the compiler must process. When these are used inside deeply nested GEMM pipelines with large dimension counts, the cumulative instantiation overhead becomes a significant contributor to frontend compile time. Measurements on `test_gemm_pipeline_compv6` show 84,288 `InstantiateFunction` calls in the frontend alone. Reducing template instantiation depth in these core utilities has a multiplicative effect because they are called from hundreds of sites. ### What changed | Metafunction | Before | After | |---|---|---| | `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion | | `sequence_gen` | O(log N) recursive binary split | O(1) via `__make_integer_seq` | | `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via `__make_integer_seq` | | `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr for-loop + pack expansion | | `sequence_inclusive_scan` | Computed via reverse + flip | O(1) constexpr for-loop (unified impl) | | `sequence_exclusive_scan` | O(N) recursive merge chain | O(1) constexpr for-loop + pack expansion | | `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr for-loop + pack expansion | Supporting changes: - Portable `__type_pack_element` fallback with `__has_builtin` guard (hipRTC-safe, no `<tuple>` dependency) - Renamed reserved `__integer_sequence` to `integer_sequence_wrapper` - Adopted `static_array` from develop (PR #4355) for constexpr computation - Unified forward and reverse inclusive scan into a single `sequence_inclusive_scan_impl` with `bool Reverse` template parameter - Added `sequence_inclusive_scan` struct (new public API for forward scan direction) - Replaced recursive `sequence_exclusive_scan` (3 template specializations) with `sequence_exclusive_scan_impl` using the same constexpr for-loop pattern as inclusive scan - Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new impl - Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and `prefix_sum_sequence` to match sibling scan function annotations ### Technical debt and housekeeping - Unified all `namespace impl` to `namespace detail` across sequence.hpp for consistency - Removed dead comment block (orphaned `integer_sequence` alternative) - Added defensive `static_assert(sizeof...(Is) > 0)` in `sequence_map_inverse::build_inverse` - Converted all multi-line Doxygen blocks from `///` to `/** */` per style guide - Corrected `constexpr static` to `static constexpr` keyword ordering in `static_array` - Added blank line between `#pragma once` and first `#include` in `static_array.hpp` - Trimmed redundant 4-line comment on `sequence_gen_helper` to a one-liner - Moved `sequence_gen` Doxygen comment below `namespace detail` block so it directly precedes the struct it documents - Added Doxygen `@brief`/`@tparam`/`@pre` documentation for `sequence_gen` and `sequence_map_inverse` public APIs - Added `@brief` documentation to `static_array` explaining relationship to `ck_tile::array` - Added scope comment at `namespace detail` openings **Note:** `private:`/`public:` access modifier indentation is enforced at 4 spaces by `.clang-format`. The style guide calls for left-alignment, but the formatter overrides this. Requires a `.clang-format` config change to resolve — not addressable in code. ### `static_array` hardening (from develop's PR #4355) - Added zero-length array guard (`T elems[N > 0 ? N : 1]`) - Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()` - Added `#include "ck_tile/core/config.hpp"` (IWYU for `CK_TILE_HOST_DEVICE`) ### Value Combined with the `static_ford` changes, measured impact on `test_gemm_pipeline_compv6`: - **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439) - **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203) - **Wall-clock: -16.3%** (611.6s → 512.2s) ### Files changed (4) - `sequence.hpp`: Metafunction optimizations, namespace unification, documentation, style fixes - `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`, documentation, style fixes - `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions (new file) - `CMakeLists.txt`: Register new test target ## Test plan - [x] 50 runtime unit tests covering all optimized and pre-existing sequence APIs - [x] Edge cases: empty sequences, single-element, larger sizes (N=8), negative values, non-trivial init values - [x] Both functor signatures tested (`operator()(index_t)` and `operator()(number<I>)`) - [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse, inclusive, and exclusive directions - [x] Exclusive scan: sum, product, single, empty, non-zero init - [x] Prefix sum: N+1 output verification, single, empty - [x] Permutation round-trip verification for `sequence_map_inverse` - [x] Full sequence public API coverage: modify, gen, uniform_gen, scans (inclusive, exclusive, prefix sum), map_inverse, make_index_sequence, size/sum/product, push/pop, reverse, extract, merge, arithmetic operators, equality, transform - [x] Portable `__type_pack_element` fallback tested implicitly (same `at_index_t` interface) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
d8ee107a47 |
[rocm-libraries] ROCm/rocm-libraries#4421 (commit 5bb5769)
[CK] Unify the grouped convolution gridwise Run() functions (#4421) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation There are currently three different grouped convolution related Run() function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`. These are used for the different types of grouped convolution: Forward, Backward weights, and Backward data. The functions are very similar and should be unified to a single `Run()` function for all types of grouped convolution. ## Technical Details The three old `Run<>()` functions were replaced with a single unified function. The new `Run<>()` function is run from device implementations: - DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 - DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor implementation uses a different `Run<>()` overload and was therefore not modified. ## Test Plan Run the following grouped convolution tests on `gfx1201`, as this architecture is WMMA-capable: - `test_grouped_convnd_fwd` - `test_grouped_convnd_bwd_weight` - `test_grouped_convnd_bwd_data` Compilation and testing were also executed on `gfx1100` to avoid CI problems. ## Test Result First part (unification of `Run<>()` function): All tests successful. Second part (integration of single `Run<>()` function as a direct call): All tests successful. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
2312eef6c3 |
[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
9f47b8a63d |
[rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15)
[CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284) Add factory, InstanceTraits, and conv traits support for the WMMA V3 forward convolution kernel, enabling the CK Builder to generate and dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs. ## Motivation As reported in issue #4944, MIOpen includes WMMA V3 forward convolution kernels, so this PR adds support for those kernels similarly to other supported kernels. ## Technical Details This follows the same implementation as the other kernels. I added some support for reflection, but I left a few todos since we need to generalize our convolution traits to generalize across WMMA/MFMA and CK/CKTile. ## Test Plan Added faster tests to `ninja smoke-builder` that check the instance-traits logic, and I added longer tests that instantiate kernels, following the existing pattern in other kernals. ## Test Result I tested all code with `ninja check-builder` on a gfx1101 build and ran on gfx1101. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
8f27f65d44 |
[rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb)
[CK_TILE] MX GEMM non-preshuffled RCR layout ## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b8def2c724 |
[rocm-libraries] ROCm/rocm-libraries#5041 (commit 481aecc)
[CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Optimize `SpaceFillingCurve` in CK to reduce compile time by precomputing all index values into a static constexpr lookup table. ### Problem - `GetIndex<N>` was instantiated separately for every index value (0 to NumAccesses-1) - Each instantiation triggered nested `static_for` loops with O(N²) template depth - This caused **34,000+ template instantiations** taking **69 seconds** in frontend ### Solution - Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed indices - Add `compute_single_index()` helper using O(N) `static_for` loops - Add `compute_all_indices()` to build entire table in one constexpr evaluation - `GetIndex<N>` becomes simple array lookup: `return index_table[N]` ### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp) | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Total compile time | 120.4s | 83.6s | **-31%** | | Frontend time | 88.7s | 52.6s | **-41%** | | GetIndex instantiations | 34,176 | 384 | **-99%** | | GetIndex time | 69.0s | 0.11s | **-99.8%** | | SpaceFillingCurve time | 75.7s | 4.3s | **-94%** | ## Test plan - [x] Builds successfully with `-Werror -Weverything` - [ ] Run existing unit tests - [ ] Verify numerical correctness on sample kernels 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
c800f88911 |
[rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523)
[CK_TILE] Update gfx11 FMHA forward kernel configs ## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 74.10 | 71.97 | 0.97x 4096 | 66.26 | 77.79 | 1.17x 8192 | 68.18 | 75.88 | 1.11x 12288 | 68.47 | 80.44 | 1.17x 16384 | 59.54 | 79.66 | 1.34x 20480 | 55.78 | 77.91 | 1.40x 24576 | 55.08 | 77.47 | 1.41x 27280 | 47.45 | 77.16 | 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 71.47 | 70.6 | 0.99x 4096 | 64.74 | 77.06 | 1.19x 8192 | 64.68 | 75.47 | 1.17x 12288 | 66.43 | 79.95 | 1.20x 16384 | 56.02 | 79.73 | 1.42x 20480 | 50.21 | 78.15 | 1.56x 24576 | 47.29 | 77.53 | 1.64x 27280 | 46.13 | 77.04 | 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 26.85 | 29.17 | 1.09x 4096 | 24.75 | 26.01 | 1.05x 8192 | 25.24 | 25.50 | 1.01x 12288 | 25.18 | 25.00 | 0.99x 16384 | 24.79 | 25.91 | 1.05x 20480 | 25.56 | 25.24 | 0.99x 24576 | 25.13 | 26.20 | 1.04x 27280 | 10.78 | 26.35 | 2.44x - mode: group / VGPR usage: 256 vs 229 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 27.44 | 26.71 | 0.97x 4096 | 21.89 | 23.09 | 1.05x 8192 | 22.85 | 24.49 | 1.07x 12288 | 24.33 | 24.42 | 1.00x 16384 | 20.05 | 24.98 | 1.24x 20480 | 14.70 | 25.15 | 1.71x 24576 | 11.30 | 26.31 | 2.33x 27280 | 10.10 | 26.32 | 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
8c216604d4 |
[rocm-libraries] ROCm/rocm-libraries#5218 (commit 60156cf)
[CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218) ## Motivation Fix the failure of the aiter to call eightwarp. Changed Async to the name eightwarps. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Pass ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
fe8b7d0c27 |
[rocm-libraries] ROCm/rocm-libraries#4742 (commit d340a14)
[CK_TILE] Fix FMHA async pipeline LDS sync issue
## Motivation
Fix FMHA forward async pipeline
(`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue.
Some attention test cases intermittently fail due to a race condition
where the V tile store to LDS overwrites K tile data that is still being
read by other threads during the tail `gemm_0` operation.
## Technical Details
In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the
same LDS memory through a rotation schedule (`LdsSeq`).
After the tail `gemm_0` (line 458), some fast threads may proceed to
store V to LDS (line 617) before slow threads finish reading K data from
the same LDS buffer.
The fix adds an `s_barrier` synchronization after the tail `gemm_0` when
K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e.,
`LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`):
`if constexpr(LdsSeq.at(number<k0_loops - 1>{}) ==
LdsSeq.at(number<k0_loops>{}))
__builtin_amdgcn_s_barrier();`
Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed):
The `gemm_0` MFMA instruction internally waits for its LDS operands
(ds_read) to complete before execution
Therefore, each thread's ds_read of K data is already complete by the
time gemm_0 finishes
Only cross-thread synchronization (`s_barrier`) is needed to ensure all
threads have finished reading before any thread starts writing V
|
||
|
|
683865895e |
[rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138)
Proof of concept for removing forward declarations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:
|
||
|
|
e2ce0cad54 |
[rocm-libraries] ROCm/rocm-libraries#4673 (commit ec385da)
Compile-time optimize threadwise slice transfer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Profiling with `-ftime-trace` on representative translation units (e.g., `device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`) revealed that **92% of frontend time was spent in template instantiation**. The primary bottleneck was redundant instantiation of identical helper logic across multiple threadwise transfer class variants. Each `ThreadwiseTensorSliceTransfer_v*` class independently contained its own copy of the same helper computations — serpentine traversal, coordinate stepping, thread scratch descriptors, lambda-like functors, and compile-time constants — duplicated across 13 header files. When a typical GEMM or convolution kernel TU includes blockwise operations (e.g., `blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants simultaneously, causing the compiler to instantiate the same helper logic multiple times with the same template arguments. This was compounded by the helpers being defined as members of the outer `ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template parameters. Functions like `ComputeForwardSweep` depend only on their two argument types, but as inline members of the outer class, the compiler was forced to create separate instantiations for every unique combination of all outer parameters (data types, descriptors, vector widths, etc.) — even when most of those parameters had no effect on the helper's output. ## Technical Details ### The Fix: Shared Helper Struct Hierarchy Duplicated logic was extracted into a standalone helper hierarchy in `threadwise_tensor_slice_transfer_util.hpp`: ``` ThreadwiseTransferHelper_Base (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor, | ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple) +-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex, | ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable) +-- ThreadwiseTransferHelper_SFC (ComputeSFCCoordinateResetStep) ``` Each helper method is now parameterized **only by what it actually uses**: - `ComputeForwardSweep(idx, lengths)` — parameterized only by the two argument types, not by `SrcData`, `DstData`, `SrcDesc`, etc. - `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by the descriptor and access sequence types. - `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector, DimAccessOrder>()` — parameterized only by the four values it actually needs. This reduces template instantiation work in two ways: 1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather): the compiler reuses a single instantiation instead of creating one per variant. 2. **Across different outer class instantiations** (fp16 vs bf16 vs int8): the compiler reuses the helper instantiation because the helper doesn't depend on the data type at all. ### Refactored Headers **13 headers** now delegate to the shared helpers instead of duplicating logic: - Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant - SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter - Dead code removed: v4r1, v5r1 ### Additional Fixes Found During Refactoring - Two latent bugs in v3r2 (`forward_sweep` indexing, `GetDstCoordinateResetStep` extraction) - Dead `SrcCoordStep` variables in v4r1 and v5r1 - Unused `scale_element_op_` member in v3r1_dequant (restored with note) ### Net Code Change +1,428 / -2,297 lines (~870 lines removed). ## Test Plan ### Unit Tests 28 host-side gtests in `test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp` covering the full helper hierarchy: | Suite | Tests | What is verified | |-------|-------|------------------| | ThreadwiseTransferHelperBase | 6 | Compile-time constants, inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false in 2D and 3D | | ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep` (even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete), `ComputeDataIndex`, `ComputeCoordinateResetStep`, `VectorSizeLookupTable`, `VectorOffsetsLookupTable` | | ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` — single access, 2D row-major, 2D column-major, 3D batch, even/odd inner access counts | | ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive from Base, are not related to each other | | DetailFunctors | 4 | `lambda_scalar_per_access`, `lambda_scalar_step_in_vector`, `lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) | ### Semantic Equivalence GPU ISA comparison using `--cuda-device-only -S` confirmed identical assembly output (modulo `__hip_cuid_*` metadata) between baseline and refactored code. ## Test Results All measurements on a 384-core machine, `-j64`, freshly rebooted, near-idle. ### Targeted Builds (affected targets only) | Target | Baseline | Refactored | Wall-clock Delta | CPU Delta | |--------|----------|------------|-----------------|-----------| | `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU | 6m 53s / 161m CPU | **-9.7%** | **-14.9%** | | `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU | 6m 42s / 182m CPU | **-31.8%** | **-10.0%** | | **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** | **-22.2%** | **-12.4%** | ### Full Project Build (8,243 targets) | Metric | Baseline | Refactored | Delta | |--------|----------|------------|-------| | Wall-clock | 103m 38s | 111m 56s | +8.0%* | | CPU time | 4705m 7s | 4648m 17s | **-1.2%** | \*Wall-clock inflated by external load spike during refactored build (load 90 vs 66). CPU time is the reliable metric. ### Context ~15% of all build targets (1,262 / 8,243) transitively include the modified headers. These are primarily GEMM and convolution kernel instantiations — the core compute workloads. The 12-15% CPU savings on affected targets is diluted to 1.2% across the full project because 85% of targets are unaffected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b0c13f3124 |
[rocm-libraries] ROCm/rocm-libraries#5094 (commit d4548e6)
[CK] use int64 for ptr offset MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3), the `expert_id * expert_stride` calculation in MOE GEMM kernels overflows `int32` (`index_t`), causing the weight matrix (B) pointer to wrap to an invalid address and triggering a GPU memory access fault. For example, with `N=1024, K=7168, IsInputGemm=true`: - `expert_stride = N * K * 2 = 14,680,064` - `INT32_MAX / expert_stride ≈ 146` - Any `expert_id >= 147` causes overflow → negative offset → illegal memory access → GPU crash ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: amd-shiraz <shiraz.ali@amd.com> |
||
|
|
03ce21ddcb |
[rocm-libraries] ROCm/rocm-libraries#4837 (commit 6316035)
[CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837) ## Motivation The existing unification work supports DENSE intrinsics. In this PR we enable support for SPARSE as well as SCALE intrinsics and add an example SPARSE implementation. ## Technical Details Mostly trivial changes. One framework change is that the desired `MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant commit explains, we do not support a fallback family at the moment, but it is something we can consider. ## Test Plan Added a new test for the relevant sparse specializations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
6e558658ea |
[rocm-libraries] ROCm/rocm-libraries#4999 (commit 45f6624)
[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV cache (#4999) Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to page start via 48-bit arithmetic, encode only within-page offset in voffset. Original code path preserved for ps1/ps16 via constexpr-if. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
ae4e632c7d |
[rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)
[CK_TILE] Add CK Tile bwd weight profiler MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation To compare old CK and CK Tile, we need to extend the current CK profiler to support running also CK Tile instance with the same API. In order to have the same instance coverage in CK Tile compared to the old CK, I've added code generation from old CK configurations to CK Tile instances using the CK Builder. ## Technical Details - The codegen python script for CK Tile fwd convs is extended to support also bwd weight and bwd data. - The generated instances are added to the CMake build (target `device_grouped_conv_bwd_weight_tile_instance`s). - A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to the CK Profiler. |
||
|
|
30702c9cbc |
[rocm-libraries] ROCm/rocm-libraries#4834 (commit e75e6cb)
[CK_TILE][GEMM] Fix eightwarp error & Add eightwarp unit test (#4834) ## Motivation The primary goal of this PR is to fix a critical issue in the EightWarps implementation within ck_tile. Additionally, unit tests were added to ensure that CI can detect errors. ## Test Plan ninja test_tile_gemm_quant_abquant_eightwarps ./bin/test_tile_gemm_quant_abquant_eightwarps ## Test Result All EightWarps related test cases in TestCkTileGemmABQuant completed successfully without linker errors or validation mismatches. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b09112bbad |
[rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. **Changes:** - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
2a16d53cce |
[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)
[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 | ((oss << sep << rest), ...); [composable_kernel] | ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 | return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] | ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 | const char* vp = std::getenv(name); [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 | _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 | #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] | ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
f1746955fd |
[rocm-libraries] ROCm/rocm-libraries#4984 (commit 962b047)
[CK_TILE] Reduce Register Spills in Stream-K Reductions (#4984) ## Motivation In CK Tile Stream-K, kernels using one of two non-atomic reduction strategies (i.e., linear, tree) have high register spill count, with the tree reduction generally being worse. These changes act a first step to help decrease the register spill count. ## Technical Details ### Problem 1: Unvectorized access to partials In both the linear and tree reductions, workgroups write partials results to a global buffer; another workgroup will later read this data. When the initial logic to support reading and writing to the partials buffer was added (see https://github.com/ROCm/composable_kernel/pull/3107), the tile distribution encoding used to read from and write to partials matches the register layout for the accumulator of the mfma instruction used for the kernel. Since we do not currently use the transposed register layout for the accumulator, we end with an encoding that is not optimized for writing to HBM. For example: Consider the register layout of the `v_mfma_f32_16x16x32_fp8_fp8` instruction. ```bash ./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix ``` <img width="1113" height="537" alt="image" src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02" /> The above shows that threads are responsible for consecutive elements down a column of the C tile. If we use this distribution to read and write to partials with C in row major, then threads are unable to perform vectorized reads and writes. Note: thread 0 is shown in red and thread 1 is shown in green. Since the C-shuffle Epilogue only supports C in row major, reading and writing to partials is highly unoptimized. ### Problem 2: Missed opportunity for SPGR use in tree reduction loop Since the reduction occurs between workgroups, all threads in the workgroup follow the same execution paths in the tree reduction logic, hence various variables should be using SGPRs, but they are not. ### Implemented Solutions 1. Add a new tile distribution encoding that is optimized for accessing partials in HBM. This encoding does not change the data assignment to threads, it merely changes the addresses to which they write/read in the partials buffer. For example, continuing with the `v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result in threads writing in the following layout: <img width="517" height="342" alt="image" src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202" /> This layout ensures that each thread writes along a row, enabling `buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses). This helps reduce register usage due to requiring fewer offset calculations. 2. To force SGPR usage in the tree reduction loop, I make use of CK Tile's `amd_wave_read_first_lane` which is a wrapper around `__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the tree reduction. _These changes do not fully eliminate register spills. Future work will aim to further reduce spills. But these changes make good progress._ ## Test Plan Added tests for different warp tile sizes to validate that the new encoding works with different `WarpGemm` variants. ## Test Result All tests pass locally on all gfx9 architectures. Some results for decreases in register spills on gfx942: (BL = baseline) | Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % | VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % | |--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:| | fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 | -100.0% | 21 | 20 | -1 | -4.8% | | fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3% | 443 | 23 | -420 | -94.8% | | fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6% | 12 | 6 | -6 | -50.0% | | fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9% | 396 | 12 | -384 | -97.0% | ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b042e1805a |
[rocm-libraries] ROCm/rocm-libraries#4804 (commit 832dd0e)
Add Tile Distribution Encoding Register Mapping debug utility for MFMA / WMMA unification work. (#4804) ## Motivation This PR adds a small utility that allows you to use Tile Distribution Encodings to directly map matrix elements to register locations and vice versa. It can also print forward and backward layout mappings similar to the Matrix Calculator utility. The utility is not meant for index calculations in actual kernels, but rather as a debugging tool and probably for automated verification of the policy structs in the new WMMA / MFMA unification design. ## Technical Details Tile Distribution Encodings are a core part of CK Tile which can define the relationship between register and intrinsic matrix fragment elements. They allow for any mapping based on unmerge and merge transformations. Also, they allow for a special "Repeat" dimensions which acts like an additional matrix dimension and allows for replication of certain matrix elements. The new mapping utility can deal with all aspects. ## Test Plan Since this is a debug utility there is nothing to directly test, but there is an example file that defines four different Tile Distribution Encodings and prints their forward and backward mappings, along with some extra parameters. ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
95dc496d30 |
[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)
Cleanup and refactoring related to tile loading MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Cleanup and refactoring done while implementing mixed precision for fp16/bf16 x fp8 Key changes: - Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and refactored the API to use consistent naming conventions - Updated load_tile_transpose functions to use output parameters instead of return values for consistency - Removed unused variable declarations and simplified type deduction logic - Define load_tile_with_elementwise to use tuple types explicitly for clarity ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered |
||
|
|
78ae3835a6 |
[rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66)
[CK] Fix gptoss sink ## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
d32d515f64 |
[rocm-libraries] ROCm/rocm-libraries#4873 (commit 580ad4f)
[CK] CK Tile improvements and fixes for depthwise merged convolutions forward (#4873) ## Motivation Performance benchmarks showed that old CK's depthwise merged convolutions are much faster than CK Tile's ones. ## Technical Details After investigation it showed up that the requirement that A/CVectorload is a multiple of gemm's rightmost dimension is too strict in case of processing multiple groups, because if tensor is in NHWGC/NHWGK format, then if C/K is equal to 1, we can use vectorloads on the G dimension, which is added by this PR. Filter5x5 specialization was also added, because some models are using it, it's similar to 3x3, the only difference is the window size. This addition was needed, because of the differences of tensor descriptor transformations betweeen CK and CK Tile. In old CK the case of grouped depthwise 5x5 convs was supported via Default specialization, but in CK Tile that case was not working properly. ## Test Plan Performance was tested by our internal test suite, which contains several DL models. ## Test Result Tests results showed significant performance uplift for depthwise(3x3, 5x5) cases |
||
|
|
1dd47118e2 |
[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas with named functors (#4828) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `generate_identity_sequences<N>()` helper that returns `Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>` - Replace lambdas with named functors in `transform_tensor_descriptor` - Add `unpack_and_merge_sequences` helper functor - Reduces `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction) ## Motivation Multiple call sites use `generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda instantiations. Additionally, each lambda in `transform_tensor_descriptor` creates a unique closure type, causing the function to be instantiated separately for every call site. Named functors share a single type, so the compiler reuses the same instantiation. ## Changes ### Part 1: generate_identity_sequences helper - Replaces common lambda pattern for generating identity sequences - Each lambda expression creates a unique closure type, causing separate template instantiations at every call site - Named helper shares a single type across all uses ### Part 2: Named functors in transform_tensor_descriptor - Add `unpack_and_merge_sequences` helper to replace lambda in `GetNumOfHiddenDimension` - Use `generate_identity_sequences` in `matrix_padder.hpp` ## Test Plan - [x] Added 7 unit tests: - 4 tests for `generate_identity_sequences` - 3 tests for `unpack_and_merge_sequences` - [ ] Waiting for full CI ## Related PRs This PR merges the functionality from: - ROCm/composable_kernel#3588 (generate_identity_sequences helper) - ROCm/composable_kernel#3589 (Named functors in transform_tensor_descriptor) Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times) **Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and ROCm/composable_kernel#3589, which can be closed once this is merged. |
||
|
|
ef82340e05 |
[rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped kernels (#4875) ## Motivation Port non-grouped convolution instances to the grouped kernels to deprecated older non-grouped implementations. ## Technical Details Add the same instances as non-grouped but using grouped kernel. ## Test Plan test_grouped_convnd_fwd ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-724 |
||
|
|
c1ec24a2de |
[rocm-libraries] ROCm/rocm-libraries#4963 (commit cb6bbf6)
[CK][CK Tile] Fix batched gemm kernel 2 lds ## Motivation Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Technical Details Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Test Plan CI overall ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5e06874aae |
[rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory (#4582) This PR adds bwd data wmma and xdl kernels to the ck builder, their instance and conv traits as well as tests for the above. |
||
|
|
c8a8449eec |
[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)
[CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Changes ### Split-K support in `gemm_quant_kernel.hpp` - **`SplitKBatchOffset`**: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided by `AQuantGroupSize::kK`. - **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - **`RunGemm`**: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. **B quant group alignment** — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. **Minimum 2 K-tile iterations per batch** (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged. |
||
|
|
9059730caf |
[rocm-libraries] ROCm/rocm-libraries#4592 (commit 45f76cb)
Tile Engine support for gfx950 ## Motivation This PR adds support for the gfx950 GPU architecture to the Tile Engine in Composable Kernel library, focusing on GEMM operations with FP8 and BF8 data types. ## Technical Details Added gfx950-specific MFMA warp GEMM implementations with conditional compilation. Updated default GEMM configuration parameters for tile sizes and warp configurations. Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950 hardware. ## Test Plan Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Test Result Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
75aea70c2c |
[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4 ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests **This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR.** ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
abf13bdec1 |
[rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)
[CK] Fix windows build issues ## Motivation Full build on Windows is currently broken due to compiler errors, this PR should help fix that. This is also holding up the following PR in the TheRock: https://github.com/ROCm/TheRock/pull/3382 ## Technical Details 1. I don't see a good reason to be nesting a windows include inside the ck_tile namespace. It was causing compiler errors too: Windows.h comes with min and max, which was conflicting with ck_tile::min and ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this inclusion in the future. 2. The TRUE/FALSE macros are already used by Windows.h, which causes an error. So I've opted for True/False. You can see this pattern in other rocm-libraries. 3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN context, from \<cmath\> on Windows. We'll be able to use std::numbers::v_pi\<float\> when we have C++20 support. 4. There was a missing \<chrono\> include. ## Test Plan Test locally and make sure this doesn't impact existing CI. ## Test Result Compiles locally and passes existing ci. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a32d704d89 |
[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
1a2c0d835a |
[rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
cd12e8e31f |
[rocm-libraries] ROCm/rocm-libraries#4295 (commit fa2cfc8)
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to allow caller to pass desciptors directly (#4295) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create tensor views from strides and sizes. This refactors the descriptor creation out and add overloaded definitions, allowing descriptors to be created separately by the caller instead of passing explicit strides, with no functional changes. This will enable further refactoring of `RunGemm` to do likewise, enabling derived kernels like BatchedContractionKernel to avoid creating separate versions (PR [#3457](https://github.com/ROCm/composable_kernel/pull/3457)). ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Since the logic within the MakeXBlockviews chains together operations on tuples, and thus the descriptors are also passed as such, adding a template parameter for the type of the input tuple was the simplest option to enable the overload without too much verbiage. However, for `MakeCBlockView` this adds a complications as the templated definitions are prone to overlap. This for now is avoided by just moving the arguments around for the descriptor version, which avoids the collision. It's not a great solution, so feel free to suggest a better one. |
||
|
|
4c626aeaa6 |
[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant (#4267) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered |
||
|
|
3af1a0aafc |
[rocm-libraries] ROCm/rocm-libraries#4355 (commit e7f6909)
[CK TILE] Refactor sequence_reverse_inclusive_scan ## Proposed changes Refactor ck tile `sequence_reverse_inclusive_scan` from recursive to for-loop. Tracking issue: #4229 This pull request introduces a new lightweight array type, `static_array`, and refactors the sequence utilities to use it for improved constexpr support and simplicity. The changes also include updates to the build system to add container-related tests. **Core Library Improvements:** * Added a new header `static_array.hpp` that defines the `static_array` type, a constexpr-friendly array with basic accessors and no custom constructors. * Updated includes in `core.hpp` and `sequence.hpp` to import `static_array`. [[1]](diffhunk://#diff-14b406eccf59794051a16c0c9c1a7e11234324bfdd107a5bbe0f173cd25bcddcR44) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76R7) **Refactoring to Use `static_array`:** * Refactored sequence utilities in `sequence.hpp` to use `static_array` instead of the previously forward-declared `array` type, including in histogram and array generation logic. [[1]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1108-R1133) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1130-R1146) * Rewrote the implementation of `sequence_reverse_inclusive_scan` to use `static_array` for intermediate storage, improving constexpr evaluation and clarity. **Build System and Testing:** * Added a new test subdirectory for container tests and a GoogleTest executable for `unit_sequence.cpp` to the CMake build configuration. [[1]](diffhunk://#diff-5d35ff7555d3f0b438d45cde06b661eb1332cdbec66287ac7ec3c478d688aae5R5) [[2]](diffhunk://#diff-1f54f0d2b431b7fc74f7b4ffb66e80c381c904c3383b1d27987467e3482d6d7aR1-R7) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
fc3180120e |
[rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum
## Motivation
Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.
## Technical Details
Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Reduction = 1u,
TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Linear = 1u,
Tree = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan
No new functionality was added, so no new tests were added; I just
validated existing tests and examples.
## Test Result
All tests passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
6aa1cd8212 |
[rocm-libraries] ROCm/rocm-libraries#4704 (commit 17662f9)
[CK_TILE] Fix FP8 MXGEMM numerical error in async load path (#4704) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fixes FP8 MXGEMM producing half the expected result (e.g., 128 instead of 256 with all 1s input). **Bug introduced in:** `b7de1e14cea70681a23cd1a136df42910c776e4a` - "[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)" ## Root Cause In the `static_move_ys=true` code path in `tile_window.hpp`, the IMM optimization computes `lds_ys_offset` using a default-constructed tensor descriptor: ```cpp make_tensor_coordinate(decltype(tensor_descriptor){}, idx_ys_offset) ``` This default-constructed descriptor has different strides than the actual DRAM tensor descriptor used for dram_ys_offset. When these offsets are mixed in the address calculation: ```cpp imm_valid = lds_ys_offset % IMM_RANGE; // From wrong descriptor wave_offset = dram_ys_offset - imm_valid; // From correct descriptor ``` The final address wave_offset + imm_valid ≠ dram_ys_offset, causing incorrect memory accesses. Fix ```cpp Set imm_valid = 0 to bypass the IMM optimization and ensure the full offset is passed through wave_offset: constexpr auto imm_valid = 0; // Avoids inconsistency between lds_ys_offset and dram_ys_offset ``` This disables the 12-bit immediate field optimization in the buffer_load_lds instruction but guarantees correctness. A proper fix would require making the DRAM tensor descriptor constexpr, which is not feasible since tensor strides depend on runtime parameters (LDA, LDB). |
||
|
|
0d92fffedb |
[rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11 ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
1915cdfcc2 |
[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762) Reverts ROCm/rocm-libraries#4638 unfortunately, this PR interfered with the PR#4299 and caused build errors for gfx11: In file included from /rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7: In file included from /rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11: /rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21: error: no matching function for call to 'CheckValidity' 553 | if(!GridwiseGemm::CheckValidity( | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||
|
|
29781f2ac4 |
[rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3 (#4638) 1. Add GetEstimateVgprCount to estimate the VGPR usage in GridwiseGemm_wmma_cshuffle_v3 2. Add IsValidCompilationParameter to disable kernel which use too many vgprs. - Currently, the threashold is AvailableVgprCount * 1.25 3. Modify examples to avoid test is disabled on gfx11 It is port from internal repo PR[#192](https://github.com/ROCm/composable_kernel/issues/192) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
7689090739 |
[rocm-libraries] ROCm/rocm-libraries#4556 (commit 15730e7)
fix: correct ULP calculation in get_absolute_threshold for BF16 tolerance (#4556) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation BF16 grouped GEMM tests were failing on gfx1201 with errors like: ``` Error: Incorrect results! out[5457621] != ref[5457621]: -66 != -65.5 max err: 0.5, number of errors: 1 ``` The calculated absolute tolerance (atol ~0.26) was too small to account for legitimate hardware vs software BF16 conversion differences (0.5 ULP). ## Changes 1. **Discrete exponent calculation**: Changed from continuous `log2()` to `floor(log2())` to match actual IEEE 754 floating-point exponent levels 2. **Full ULP for output_error**: Changed from 0.5 to 1.0 ULP to account for hardware `__bf16` vs software `float_to_bf16()` conversion differences ## Calculation Example For the failing case with value ~66: **Before (incorrect):** ``` expo = log2(66) = 6.044... atol = 2^(6.044 - 7) * 0.5 = 2^(-0.956) * 0.5 ≈ 0.26 Error 0.5 > 0.26 → Test fails ❌ ``` **After (correct):** ``` discrete_expo = floor(log2(66)) = 6 atol = 2^(6 - 7) * 1.0 = 2^(-1) * 1.0 = 0.5 Error 0.5 ≤ 0.5 → Test passes ✓ ``` The ULP for values in [64, 128) is 2^(-1) = 0.5, and the error of 0.5 is exactly 1 ULP, which is the maximum expected difference between hardware and software BF16 conversions at tie cases. ## Rationale Hardware and software BF16 conversions can differ by up to 1 ULP at tie cases due to different rounding strategies (hardware vs IEEE 754 round-to-nearest-even). The discrete exponent ensures ULP is calculated correctly for all values within an exponent range. **Modified file**: `projects/composablekernel/include/ck_tile/host/check_err.hpp` |
||
|
|
7b97e197ef |
[rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered |
||
|
|
0a2b6c4bcd |
[rocm-libraries] ROCm/rocm-libraries#4297 (commit 5ff580c)
moe flatmm xcd remap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit co-authors: @Chi-Chu319 @juuso-oskari Added XCD remapping for flatmm moe <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Arial, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </style> </head> <body link="#467886" vlink="#96607D"> batch | Mixtral (tflops, wip_355) | Mixtral-7B (tflops, our branch) | perf boost -- | -- | -- | -- 64 | 865.424 | 995.455 | 15.0% 256 | 886.336 | 1020.96 | 15.2% 1024 | 890.808 | 1022.53 | 14.8% </body> </html> |
||
|
|
5cb8109535 |
[rocm-libraries] ROCm/rocm-libraries#4640 (commit 37b8c81)
Fix the Composable Kernel CI and versions incompatibility (#4640) ## Motivation This PR has 4 patches: 1. Fix the CI error of grouped gemm. 2. Fix the incompatibility of old linux version. 3. Fix the potential errors of flatmm. 4. Address the previous comments of abquant eight warps pipeline solution. |
||
|
|
9c2dd2941b |
[rocm-libraries] ROCm/rocm-libraries#4419 (commit e241f8b)
[CK] Work around staging compiler lifetime warning ## Motivation The staging compiler enables lifetime-safety warnings and we already worked around a few of them. This works around a few more instances that came up recently on gfx950 builds. The initial PR that resolved most issues: https://github.com/ROCm/composable_kernel/pull/3640 ## Technical Details This follows the pattern to locally ignore the newly added lifetime-safety warnings that were moved from experimental to production in upstream LLVM. As a result, CK turned them on and treats them as errors, which prevents the staging compiler from building CK. ## Test Plan ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
dae352e8dc |
[rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93)
add memsetasync for ck moe splitk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes add memsetasync for ck moe splitk to fix ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered |
||
|
|
e1e2f7ac2e |
[rocm-libraries] ROCm/rocm-libraries#4447 (commit 6d08a99)
[CK] Optimize multi-dimensional static for loop decomposition (#4447) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Recursive template implementations might initially seem attractive to minimize necessary coding. Unfortunately, this style is often affects readability and requires significant resources from the compiler to generate instantiation chains. In "high-traffic" code (e.g., used in many places + compilation units), this generally does not scale well and can bloat the overall compile times to unnecessary lengths. The aim of this PR is to take some of most high-traffic utility code and try our best to eliminate recursive templates in favor of fold expansions and constexpr function helpers. In local tests with clang build analyzer, device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp showed high hit-rates on slow template instantiations in static_for, dimensional static_for (static_ford), which are subsequently affected by implementation of the Sequence class and associated transforms. Example: **** Templates that took longest to instantiate: 70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1... (372 times, avg 188 ms) // **70 seconds!** The above is part of the implementation of static_for which uses Sequence classes.. ## Technical Details ### Summary of Optimization Techniques | Technique | Used In | Benefit | |-----------|---------|---------| | __Constexpr for-loop computation__ | sequence_reverse_inclusive_scan, sequence_map_inverse | Moves O(N) work from template instantiation to constexpr evaluation | | __Pack expansion with indexing__ | sequence_reverse, Sequence::Modify | Single template instantiation instead of recursive | | __Flat iteration + decomposition__ | ford, static_ford | O(1) template depth instead of O(N^D) | | __Pre-computed strides__ | index_decomposer | Enables O(1) linear-to-multi-index conversion | ### Impact on Compile Time These optimizations reduce template instantiation depth from O(N) or O(N^D) to O(1), which: 1. Reduces compiler memory usage 2. Reduces compile time exponentially for deep instantiation chains 3. Enables larger iteration spaces without hitting template depth limits ## Test Plan * Existing tests for Sequence are re-used to affirm correctness * Unit tests for ford and static_ford are added (dimensional looping) * 8 new regression tests specifically verify the fixes for the PR feedback: - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford - `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford consistency - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford - `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with non-trivial ordering ## Test Result ### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized) #### Commits in Range (8 commits) 1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and sequence_reverse 2. `7a7e3fdeef` - Optimize sequence_map_inverse 3. `92855c9913` - Optimize ford and static_ford calls to eliminate nested template recursion 4. `88a564032b` - Add unit tests for ford and static_ford 5. `1a0fb22217` - Fix clang-format 6. `8a0d26bddf` - Increase template recursion depth to 1024 7. `dc53bb6e20` - Address copilot feedback and add regression tests 8. `477e06861d` - Increase bracket depth to 1024 #### Build Timing Results | File | Base (8b72bc8759d9 | HEAD(a0438bd398) | Improvement | |------|------|------|-------------| | grouped_conv2d_fwd (f16) -j1 | 313.31s | 272.93s | __12.9% faster__ | | grouped_conv1d_fwd (bf16) -j1 | 79.33s | 68.61s | __13.5% faster__ | | grouped_conv1d_bwd_weight (f16) -j1| 15.77s | 14.31s | __9.2% faster__ | | device_grouped_conv2d_fwd_instance -j64 | s | s | __% faster__ | #### Key Optimizations 1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1) template depth 2. __sequence_map_inverse__: O(N) → O(1) template depth 3. __ford/static_ford__: O(N^D) → O(1) template depth using flat iteration with index decomposition 4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial orderings ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
04eddbc5ce |
[rocm-libraries] ROCm/rocm-libraries#4471 (commit 10fa702)
[CK] Optimize vector type build times **Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI issues on import** ## Proposed changes Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are **frontend parsing** and **backend codegen and compilation**. ### Frontend Parsing The length of the code, the include header tree and macro expansions all affect the front-end parsing time. This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization. * Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place. * The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t. ### Backend Codegen Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used. * Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>. ``` template <typename T> struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>> { using d1_t = T; typedef T d2_t __attribute__((ext_vector_type(2))); typedef T d4_t __attribute__((ext_vector_type(4))); using type = d4_t; union { d4_t d4_; StaticallyIndexedArray<d1_t, 4> d1x4_; StaticallyIndexedArray<d2_t, 2> d2x2_; StaticallyIndexedArray<d4_t, 1> d4x1_; } data_; ... }; ``` * Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation. ``` template <typename T, index_t N> struct StaticallyIndexedArrayImpl { using type = typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type, typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type; }; ``` This union storage has been removed from the vector_type storage class. * Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays. ``` template <typename T, index_t N> struct StaticallyIndexedArray_v2 { ... T data_[N]; }; ``` ### Fixes * Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type. ``` template <typename T, index_t Rank> using NativeVectorT = T __attribute__((ext_vector_type(Rank))); sizeof(NativeVectorT<bool, 4>) == 1 (1 byte per 4 bool - packed) element0 = bit 0 of byte 0 element1 = bit 1 of byte 0 element2 = bit 2 of byte 0 element3 = bit 3 of byte 0 sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4 (1 byte per bool) element0 = bit 0 of byte 0 element1 = bit 0 of byte 1 element1 = bit 0 of byte 2 element1 = bit 0 of byte 3 union{ NativeVectorT<bool, 4> d1_t; ... StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1; }; // union size == 4 which means invalid slicing! ``` * Math utilities such as next_power_of_two addressed for invalid cases of X < 2 * Remove redundant implementation of next_pow2 ### Additions * integer_log2_floor to math.hpp * is_power_of_two_integer to math.hpp ### Build Time Analysis Machine: banff-cyxtera-s78-2 Target: gfx942 | Build Target | Threads | Frontend Parse Time (s) | Backend Codegen Time (s) | TotalTime (s) | commitId | |---------------|---------|-------------------------|--------------------------|---------------| |