Commit Graph

14 Commits

Author SHA1 Message Date
Aviral Goel
15c904b460 [rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a)
ck_tile: add FillUniformScaleDistribution and fix MX GEMM
 scale init (#7724)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Problem
MX GEMM pipeline tests were passing vacuously: scale bytes were drawn
from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far
below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so
numerical checks passed without exercising the GEMM.

### Changes

**`include/ck_tile/host/fill.hpp`** — new
`FillUniformScaleDistribution<ScaleType>` functor
- Accepts human-readable float bounds and maps them to the raw byte
range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE
754 exponent into the type's bias space
- Sampling is uniform over raw bytes → uniform over representable values
- Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to
avoid shifting negative signed integers (C++17 UB)
- Adds `assert(min_r <= max_r)` to catch inverted-range UB when both
bounds exceed the type's representable range
- Provides default member values (0.125f, 2.0f) and `std::optional` seed
consistent with sibling fillers
- `/** */` Doxygen style with `@note` on snapping asymmetry

**`test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`** — fix scale
initialization
- Replace manual byte-range distribution with
`FillUniformScaleDistribution<>{0.125f, 2.0f}`
- Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid
correlated scale tensors that were causing 60 test failures for
fp4+e5m3/e4m3 combinations

**`test/ck_tile/utility/test_fill.cpp`** — new unit tests for
`FillUniformScaleDistribution`
- 16 typed tests across e8m0, e4m3, e5m3: validity, range,
reproducibility, coverage, snapping, stress, nullopt seed, and range
overload
- Test helper `expected_raw_range` mirrors implementation clamping
exactly
2026-05-29 18:45:13 +00:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
JH-Leon-KIM-AMD
720ceb6500 [rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)
[CK Tile] Support multi-vector reads in static encoding patterns  (#7528)

## Motivation

The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.

Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.

This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.

## Technical Details

Three files changed.

### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`

Add a per-thread X iteration dimension in all three raked
specializations:

- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)

`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.

The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.

Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.

### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`

Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.

Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).

### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`

Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:

```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
    print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
    continue
```

Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.

### Compatibility

Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.

## Test Plan

The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.

For each test below we ran:

```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd        [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data   [-prec={fp16,bf16}]
```

with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:

| Test | M / N / K | VecA/B/C | A path | B path | dtype |

|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |

A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.

For the generator change we ran:

```
python3 generate_instances.py --mode profiler --direction all
```

and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.

`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).

## Test Result

- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.

Sample run output (T4b, bf16, bwd_data):

```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-20 17:25:22 +03:00
Christopher Millette
c8cf71f179 [rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650)
[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938)

## Problem

Each `static_for<0, N, 1>` instantiates its lambda N times (one per
`number<I>` type). When nested, intermediate lambdas capture the outer
loop variable (a different type per iteration), creating unique closure
types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32
= 52 IR functions, of which 20 are intermediate closures that get
inlined away but still cost frontend compile time.

ck_tile's `static_ford` was supposed to eliminate these intermediates
(as old CK's PR #5031 did successfully), but it used a **recursive**
`static_ford_impl` that recreated the same closure pattern plus added
`reorder_old_to_new`/`reorder_new_to_old` overhead.

Additionally, the sequence utility layer (`sequence_sort`,
`is_valid_sequence_map`) used recursive template metaprogramming that
generated O(N log N) intermediate types for every permutation validation
— called on every `reorder_new_to_old`/`reorder_old_to_new` invocation.

## Changes

### 1. Replace `sequence_sort` with constexpr insertion sort
Replace recursive merge sort (`sequence_sort_impl` +
`sorted_sequence_merge_impl`, O(N log N) intermediate type
instantiations) with constexpr insertion sort using `static_array`. O(1)
template depth, same `::type` and `::sorted2unsorted_map` API.

### 2. Replace `is_valid_sequence_map` with constexpr check
Replace sort-based permutation validation (which instantiated the full
`sequence_sort` chain) with a constexpr "seen array" loop. O(N)
constexpr steps instead of O(N log N) template instantiations.

### 3. Replace recursive `static_ford` with flat-loop `index_decomposer`
Replace `static_ford_impl` (recursive `static_for` nesting +
`pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat
`index_decomposer` using pre-computed strides. Add `decompose_reordered`
alias that folds reordering into decomposition, and `inverse_perm`
helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map`
→ `sequence_sort` chain.

### 4. Eliminate internal lambda via `ford_applier`
The flat-loop approach still used `static_for` with a lambda, creating
M×N internal lambda instantiations per call site. Replace with
`ford_applier` struct that calls `f(decompose<I>{})` directly via fold
expression — zero intermediate closures:

```cpp
// Before: 2×M×N function instantiations
static_for<0, M*N, 1>{}([&](auto i) { f(decompose<i>{}); });

// After: M×N function instantiations (50% reduction)
ford_applier<Decomposer, make_index_sequence<M*N>>{}(f);
```

Also unified identity and non-identity order paths into a single
template with `constexpr if`.

### 5. Fix const-qualified sequence handling
Fix `is_valid_sequence_map` to handle const-qualified sequence types via
`remove_cvref_t` in callers (`tensor_adaptor.hpp`,
`tile_distribution_encoding.hpp`).

## Results (this PR only, without flattening)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5)

| Target | Base (s) | Treat (s) | Delta | % | Wins | Significant? |
|--------|----------|-----------|-------|---|------|-------------|
| **flatmm** | 160.1 | 152.7 | **-7.4s** | **-4.6%** | 6/7 | **YES**
(W+=1, p<0.05) |
| universal_gemm | 228.4 | 224.7 | -3.7s | -1.6% | 6/7 | Trending (W+=4)
|

Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5]
Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6]

### IR Function Counts (device trace, gfx942)

| Target | Metric | Before | After | Delta | % |
|--------|--------|--------|-------|-------|---|
| **universal_gemm** | InstantiateFunction | 117,715 | 109,165 |
**-8,550** | **-7.3%** |
| **universal_gemm** | CodeGen Function | 47,912 | 45,044 | **-2,868** |
**-6.0%** |
| **flatmm** | InstantiateFunction | 100,939 | 95,127 | **-5,812** |
**-5.8%** |
| **flatmm** | CodeGen Function | 42,651 | 40,367 | **-2,284** |
**-5.4%** |

Note: The `ford_applier` (commit 3) has minimal additional effect in
this PR since ck_tile code does not yet use `static_ford` extensively.
Its impact compounds when the follow-up flattening PR #5939 converts 124
`static_for` nests to `static_ford`. Combined results with #5939: flatmm
**-7.5%** wall time (p<0.01), CodeGen **-10.5%**.

### ASM Equivalence
7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 +
gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] `test_ck_tile_static_ford`: 13 behavioral tests
(identity/non-identity orders, 1D-4D, unit dimensions, edge cases)
- [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map,
is_valid_sequence_map edge cases, sequence_unique_sort map round-trip)
- [x] ASM equivalence verified (980K lines)
- [x] Wilcoxon timing verified (7 trials, flatmm p<0.05)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 15:25:14 -06:00
Christopher Millette
4932bc3b7b [rocm-libraries] ROCm/rocm-libraries#5028 (commit 5131491)
[CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028)

## Summary

### Rationale
`ck_tile::sequence` is the most fundamental metaprogramming type in
ck_tile — it underpins
tensor dimensions, strides, loop bounds, and index calculations. Six of
its metafunctions
use recursive template instantiation, producing O(N) to O(N²)
intermediate types that
the compiler must process. When these are used inside deeply nested GEMM
pipelines with
large dimension counts, the cumulative instantiation overhead becomes a
significant
contributor to frontend compile time.

Measurements on `test_gemm_pipeline_compv6` show 84,288
`InstantiateFunction` calls in
the frontend alone. Reducing template instantiation depth in these core
utilities has a
multiplicative effect because they are called from hundreds of sites.

### What changed

| Metafunction | Before | After |
|---|---|---|
| `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion
|
| `sequence_gen` | O(log N) recursive binary split | O(1) via
`__make_integer_seq` |
| `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via
`__make_integer_seq` |
| `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr
for-loop + pack expansion |
| `sequence_inclusive_scan` | Computed via reverse + flip | O(1)
constexpr for-loop (unified impl) |
| `sequence_exclusive_scan` | O(N) recursive merge chain | O(1)
constexpr for-loop + pack expansion |
| `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr
for-loop + pack expansion |

Supporting changes:
- Portable `__type_pack_element` fallback with `__has_builtin` guard
(hipRTC-safe, no `<tuple>` dependency)
- Renamed reserved `__integer_sequence` to `integer_sequence_wrapper`
- Adopted `static_array` from develop (PR #4355) for constexpr
computation
- Unified forward and reverse inclusive scan into a single
`sequence_inclusive_scan_impl` with `bool Reverse` template parameter
- Added `sequence_inclusive_scan` struct (new public API for forward
scan direction)
- Replaced recursive `sequence_exclusive_scan` (3 template
specializations) with `sequence_exclusive_scan_impl` using the same
constexpr for-loop pattern as inclusive scan
- Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new
impl
- Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and
`prefix_sum_sequence` to match sibling scan function annotations

### Technical debt and housekeeping
- Unified all `namespace impl` to `namespace detail` across sequence.hpp
for consistency
- Removed dead comment block (orphaned `integer_sequence` alternative)
- Added defensive `static_assert(sizeof...(Is) > 0)` in
`sequence_map_inverse::build_inverse`
- Converted all multi-line Doxygen blocks from `///` to `/** */` per
style guide
- Corrected `constexpr static` to `static constexpr` keyword ordering in
`static_array`
- Added blank line between `#pragma once` and first `#include` in
`static_array.hpp`
- Trimmed redundant 4-line comment on `sequence_gen_helper` to a
one-liner
- Moved `sequence_gen` Doxygen comment below `namespace detail` block so
it directly precedes the struct it documents
- Added Doxygen `@brief`/`@tparam`/`@pre` documentation for
`sequence_gen` and `sequence_map_inverse` public APIs
- Added `@brief` documentation to `static_array` explaining relationship
to `ck_tile::array`
- Added scope comment at `namespace detail` openings

**Note:** `private:`/`public:` access modifier indentation is enforced
at 4 spaces by
`.clang-format`. The style guide calls for left-alignment, but the
formatter overrides
this. Requires a `.clang-format` config change to resolve — not
addressable in code.

### `static_array` hardening (from develop's PR #4355)
- Added zero-length array guard (`T elems[N > 0 ? N : 1]`)
- Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()`
- Added `#include "ck_tile/core/config.hpp"` (IWYU for
`CK_TILE_HOST_DEVICE`)

### Value
Combined with the `static_ford` changes, measured impact on
`test_gemm_pipeline_compv6`:
- **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439)
- **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203)
- **Wall-clock: -16.3%** (611.6s → 512.2s)

### Files changed (4)
- `sequence.hpp`: Metafunction optimizations, namespace unification,
documentation, style fixes
- `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`,
documentation, style fixes
- `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions
(new file)
- `CMakeLists.txt`: Register new test target

## Test plan
- [x] 50 runtime unit tests covering all optimized and pre-existing
sequence APIs
- [x] Edge cases: empty sequences, single-element, larger sizes (N=8),
negative values, non-trivial init values
- [x] Both functor signatures tested (`operator()(index_t)` and
`operator()(number<I>)`)
- [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse,
inclusive, and exclusive directions
- [x] Exclusive scan: sum, product, single, empty, non-zero init
- [x] Prefix sum: N+1 output verification, single, empty
- [x] Permutation round-trip verification for `sequence_map_inverse`
- [x] Full sequence public API coverage: modify, gen, uniform_gen, scans
(inclusive, exclusive, prefix sum), map_inverse, make_index_sequence,
size/sum/product, push/pop, reverse, extract, merge, arithmetic
operators, equality, transform
- [x] Portable `__type_pack_element` fallback tested implicitly (same
`at_index_t` interface)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-03-11 14:24:54 -06:00
John Afaganis
ec23be0b9d Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
2026-01-02 22:16:41 -07:00
Aviral Goel
4dcc3e59c1 chore: update copyright header for misc files (#3402)
* chore: update copyright header for misc files

* fix: typo in kernel resulting in ci failure
2025-12-11 08:25:29 -08:00
Yi DING
c1c2e41a03 [CK_TILE] Generate random tensor values with multiple threads (#3324) 2025-12-09 11:02:33 +08:00
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Aviral Goel
f6c999bddb chore(copyright): update copyright header for test directory (#3265) 2025-11-22 19:38:27 -05:00
Aviral Goel
45be741586 fix: fix bug in print tile window when printing bf8/fp8 tiles (#3120)
* fix: fix bug in print tile window when printing bf8/fp8 tiles

* test(print_tile_window_range): add unit tests to maintain function integrity

* fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8
2025-11-01 15:28:07 -04:00
msaffari-amd
47d020a993 refactor: use snake_case naming in ck_tile/core components (#2766) 2025-09-03 09:34:11 +02:00
John Shumway
9751583f95 Replace auto with function template for c++17. (#2754)
In #2443, a helper function was added test new print functionality, but it used auto for the function parameter types.

We need to support c++17 for downstream libraries, so we cannot use auto there. This PR replaces it witht the equivalent function template implementation.
2025-08-27 17:10:39 -07:00
Sami Remes
ffdee5e774 [CK_TILE] Enable printing more structures in CK-Tile (#2443)
* Add more printing to core cktile

* Revert other changes in static encoding pattern

* Refactor to using a free print() function

* Remove loops and print just the containers

* Print tuple with better formatting, fix sequence compilation

* Add some tests for print utility

* Add print utility header

* Print for static_encoding_pattern

* add buffer_view printing

* Align vector_traits

* Fix formatting

* Lower-case enum strings

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Remove empty comment lines

* Fix test with lower-case too

* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y

* Add test_print_common.hpp

* add print.hpp in core.hpp

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-08-07 15:45:27 +03:00