678 Commits

Author SHA1 Message Date
lalala-sh
345a56c55e [rocm-libraries] ROCm/rocm-libraries#5086 (commit f4880d7)
[CK] Fix MOE FP8 SplitK buffer descriptor OOB

When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.

Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.

Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 02:43:30 +00:00
Christopher Millette
e5683e2290 [rocm-libraries] ROCm/rocm-libraries#5031 (commit 1d86a92)
[CK] Replace nested static_for with static_ford to reduce
 device IR function emissions [1B] (#5031)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Rationale
CK's GPU kernels are among the slowest files in the ROCm build, with a
single translation
unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified
nested `static_for`
loops as the root cause: each nesting level multiplies the number of
unique lambda IR
functions the compiler must process. A 2-level nest of `static_for<0, M,
1>` /
`static_for<0, N, 1>` produces M×N unique lambda types. With typical
GEMM dimensions
(M=16, N=4), a single nest generates 64 unique functions — and these
nests appear hundreds
of times across the codebase.

The LLVM backend's CGSCC (Call Graph Strongly Connected Components)
framework processes
each function independently, so reducing function count directly reduces
backend time.

### What changed
393 nested compile-time loop patterns across 73 files are converted to
`static_ford`, which
flattens multi-dimensional compile-time iteration into a single
`static_for` with index
decomposition. This eliminates 994 `static_for` nesting levels (42%
reduction).

Three pattern categories were converted:
- **Category A**: `static_for` wrapping `static_ford` — fold outer
dimension into ford
- **Category B**: nested `static_ford` — merge into single
higher-dimensional ford
- **Category C**: nested `static_for` chains — convert to single
`static_ford`

### Verification

**ASM equivalence: PASS — 51/51 device assembly files identical (gfx942
+ gfx1100)**

| Architecture | Files compared | Largest file | Result |
|---|---|---|---|
| gfx942 | 36 | 386,685 lines | ALL MATCH |
| gfx1100 | 15 | 47,769 lines | ALL MATCH |

**Build time (Wilcoxon signed-rank test, 7 paired trials):**

| Target | Pre (s) | Post (s) | Delta | p-value |
|---|---|---|---|---|
| bscale | 169 | 152 | **-9.8%** | 0.016 \* |
| xdl_v1234 | 207 | 194 | **-6.6%** | 0.016 \* |
| preshuffle | 275 | 264 | **-3.9%** | 0.016 \* |
| xdl_base | 142 | 137 | **-3.2%** | 0.031 \* |

**IR function counts (device backend, gfx942):**

| Target | InstFunc Δ | CodeGen Δ | Compiler Δ |
|---|---|---|---|
| bscale | -13,043 (-8.2%) | -2,103 (-3.5%) | -10.7% |
| xdl_v1234 | -9,431 (-5.7%) | +59 (+0.1%) | -5.2% |
| xdl_base | -6,162 (-4.9%) | -1,141 (-2.5%) | -2.2% |
| xdl_old | -3,234 (-3.7%) | -963 (-8.7%) | -3.3% |

### Value
- **994 fewer `static_for` nesting levels** (-42%) across 73 files
- **393 `static_ford` sites** created (from 4 pre-existing)
- **Up to 9.8% compile-time reduction** on representative targets
(statistically significant, p < 0.05)
- **Up to 13K fewer IR function instantiations** per translation unit
- Net -849 LOC from reduced indentation
- **Zero ASM changes** — identical device code output verified on gfx942
and gfx1100
- All scheduling barriers, `if constexpr` guards, and MFMA/WMMA
accumulation order preserved

### Files changed (73)
- `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle,
blockscale variants)
- `grid/`: 20 files (softmax, normalization, reduction, attention,
layernorm)
- `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops,
reduction)
- `tensor_description/`: 1 file (tensor_adaptor)

## Test plan
- [x] `static_ford` tested with 21 unit tests in
`test/util/unit_ford.cpp`
  (1D-4D, custom orders, compile-time verification)
- [x] All conversions preserve iteration order, `block_sync_lds()`
placement,
  `if constexpr` scheduling guards, and MFMA/WMMA accumulation order
- [x] ASM equivalence verified: 51 device `.s` files across gfx942 +
gfx1100
- [x] Build-time improvement statistically confirmed (Wilcoxon, p <
0.05, 4 targets)
- [x] IR function count reduction confirmed via `-ftime-trace` on 7
targets
- [x] Detection script reports 0 remaining safe patterns (180 blocked
with structural reasons)
- [x] Existing CI tests (GEMM, softmax, normalization, batch norm,
reduction,
  attention) exercise all converted code paths

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-18 14:46:50 +00:00
Bartłomiej Kocot
9c414d2e59 [rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31)
[CK][CK Tile] Grouped Convolution backward weight profiler
 flush cache (#5454)

## Motivation

Flush cache to get more stable results during profiling old ck and ck
tile.

## Technical Details

Flush cache before each kernel call and one more first run.

## Test Plan

test_grouped_conv_bwd_weight_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-16 17:47:07 +00:00
lalala-sh
a3ccd5dca1 [rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b)
[CK] fix moe memset size which is bigger than alloc

## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:57 +00:00
Bartłomiej Kocot
2169367735 [rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5)
[CK][CK Tile] Improvements for grouped conv fwd tile
 profiling (#5114)

## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 22:39:20 +00:00
JP-Fernando
d8ee107a47 [rocm-libraries] ROCm/rocm-libraries#4421 (commit 5bb5769)
[CK] Unify the grouped convolution gridwise Run() functions
 (#4421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

There are currently three different grouped convolution related Run()
function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`.
These are used for the different types of grouped convolution: Forward,
Backward weights, and Backward data.
The functions are very similar and should be unified to a single `Run()`
function for all types of grouped convolution.

## Technical Details

The three old `Run<>()` functions were replaced with a single unified
function.
The new `Run<>()` function is run from device implementations:

-  DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3

-  DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeight_Wmma_CShuffleV3

The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor
implementation uses a different `Run<>()` overload and was therefore not
modified.

## Test Plan

Run the following grouped convolution tests on `gfx1201`, as this
architecture is WMMA-capable:

- `test_grouped_convnd_fwd`

- `test_grouped_convnd_bwd_weight`

- `test_grouped_convnd_bwd_data`

Compilation and testing were also executed on `gfx1100` to avoid CI
problems.

## Test Result

First part (unification of `Run<>()` function): All tests successful.

Second part (integration of single `Run<>()` function as a direct call):
All tests successful.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 16:40:12 +00:00
John Shumway
9f47b8a63d [rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15)
[CK_BUILDER] Add
 DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)

Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.

## Motivation

As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.

## Technical Details

This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.

## Test Plan

Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.

## Test Result

I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 23:43:03 +00:00
Márton Bidlek
683865895e [rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138)
Proof of concept for removing forward declarations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:

9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:

MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).

GetInstanceString() calls instance_string<MyDeviceOp>()

instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()

InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()

So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.

## Technical Details

C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.

## Test Plan

The results were verified by running the existing regression tests for
CK Builder

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-09 16:35:26 +00:00
Christopher Millette
e2ce0cad54 [rocm-libraries] ROCm/rocm-libraries#4673 (commit ec385da)
Compile-time optimize threadwise slice transfer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Profiling with `-ftime-trace` on representative translation units (e.g.,

`device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`)
revealed
that **92% of frontend time was spent in template instantiation**. The
primary
bottleneck was redundant instantiation of identical helper logic across
multiple
threadwise transfer class variants.

Each `ThreadwiseTensorSliceTransfer_v*` class independently contained
its own
copy of the same helper computations — serpentine traversal, coordinate
stepping, thread scratch descriptors, lambda-like functors, and
compile-time
constants — duplicated across 13 header files. When a typical GEMM or
convolution kernel TU includes blockwise operations (e.g.,
`blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants
simultaneously, causing the compiler to instantiate the same helper
logic
multiple times with the same template arguments.

This was compounded by the helpers being defined as members of the outer
`ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template
parameters.
Functions like `ComputeForwardSweep` depend only on their two argument
types,
but as inline members of the outer class, the compiler was forced to
create
separate instantiations for every unique combination of all outer
parameters
(data types, descriptors, vector widths, etc.) — even when most of those
parameters had no effect on the helper's output.

## Technical Details

### The Fix: Shared Helper Struct Hierarchy

Duplicated logic was extracted into a standalone helper hierarchy in
`threadwise_tensor_slice_transfer_util.hpp`:

```
ThreadwiseTransferHelper_Base          (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor,
|                                       ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple)
+-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex,
|                                       ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable)
+-- ThreadwiseTransferHelper_SFC       (ComputeSFCCoordinateResetStep)
```

Each helper method is now parameterized **only by what it actually
uses**:

- `ComputeForwardSweep(idx, lengths)` — parameterized only by the two
argument
  types, not by `SrcData`, `DstData`, `SrcDesc`, etc.
- `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by
the
  descriptor and access sequence types.
- `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector,
DimAccessOrder>()` — parameterized only by the four values it actually
needs.

This reduces template instantiation work in two ways:
1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather):
the
compiler reuses a single instantiation instead of creating one per
variant.
2. **Across different outer class instantiations** (fp16 vs bf16 vs
int8): the
compiler reuses the helper instantiation because the helper doesn't
depend
   on the data type at all.

### Refactored Headers

**13 headers** now delegate to the shared helpers instead of duplicating
logic:
- Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant
- SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter
- Dead code removed: v4r1, v5r1

### Additional Fixes Found During Refactoring

- Two latent bugs in v3r2 (`forward_sweep` indexing,
`GetDstCoordinateResetStep` extraction)
- Dead `SrcCoordStep` variables in v4r1 and v5r1
- Unused `scale_element_op_` member in v3r1_dequant (restored with note)

### Net Code Change

+1,428 / -2,297 lines (~870 lines removed).

## Test Plan

### Unit Tests

28 host-side gtests in
`test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp`
covering the full helper hierarchy:

| Suite | Tests | What is verified |
|-------|-------|------------------|
| ThreadwiseTransferHelperBase | 6 | Compile-time constants,
inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false
in 2D and 3D |
| ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep`
(even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete),
`ComputeDataIndex`, `ComputeCoordinateResetStep`,
`VectorSizeLookupTable`, `VectorOffsetsLookupTable` |
| ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` —
single access, 2D row-major, 2D column-major, 3D batch, even/odd inner
access counts |
| ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive
from Base, are not related to each other |
| DetailFunctors | 4 | `lambda_scalar_per_access`,
`lambda_scalar_step_in_vector`,
`lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) |

### Semantic Equivalence

GPU ISA comparison using `--cuda-device-only -S` confirmed identical
assembly
output (modulo `__hip_cuid_*` metadata) between baseline and refactored
code.

## Test Results

All measurements on a 384-core machine, `-j64`, freshly rebooted,
near-idle.

### Targeted Builds (affected targets only)

| Target | Baseline | Refactored | Wall-clock Delta | CPU Delta |
|--------|----------|------------|-----------------|-----------|
| `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU |
6m 53s / 161m CPU | **-9.7%** | **-14.9%** |
| `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU |
6m 42s / 182m CPU | **-31.8%** | **-10.0%** |
| **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** |
**-22.2%** | **-12.4%** |

### Full Project Build (8,243 targets)

| Metric | Baseline | Refactored | Delta |
|--------|----------|------------|-------|
| Wall-clock | 103m 38s | 111m 56s | +8.0%* |
| CPU time | 4705m 7s | 4648m 17s | **-1.2%** |

\*Wall-clock inflated by external load spike during refactored build
(load 90 vs 66). CPU time is the reliable metric.

### Context

~15% of all build targets (1,262 / 8,243) transitively include the
modified
headers. These are primarily GEMM and convolution kernel instantiations
— the
core compute workloads. The 12-15% CPU savings on affected targets is
diluted
to 1.2% across the full project because 85% of targets are unaffected.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:27:59 +00:00
lalala-sh
b0c13f3124 [rocm-libraries] ROCm/rocm-libraries#5094 (commit d4548e6)
[CK] use int64 for ptr offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3),
the `expert_id * expert_stride` calculation in MOE GEMM kernels
overflows `int32` (`index_t`), causing the weight matrix (B) pointer to
wrap to an invalid address and triggering a GPU memory access fault.

For example, with `N=1024, K=7168, IsInputGemm=true`:
- `expert_stride = N * K * 2 = 14,680,064`
- `INT32_MAX / expert_stride ≈ 146`
- Any `expert_id >= 147` causes overflow → negative offset → illegal
memory access → GPU crash

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: amd-shiraz <shiraz.ali@amd.com>
2026-03-06 02:01:03 +00:00
Max Podkorytov
1dd47118e2 [rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas
 with named functors (#4828)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add `generate_identity_sequences<N>()` helper that returns
`Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>`
- Replace lambdas with named functors in `transform_tensor_descriptor`
- Add `unpack_and_merge_sequences` helper functor
- Reduces `transform_tensor_descriptor` instantiations from 388 to 32
(92% reduction)

## Motivation

Multiple call sites use `generate_tuple([](auto i) { return
Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda
instantiations.

Additionally, each lambda in `transform_tensor_descriptor` creates a
unique closure type, causing the function to be instantiated separately
for every call site. Named functors share a single type, so the compiler
reuses the same instantiation.

## Changes

### Part 1: generate_identity_sequences helper
- Replaces common lambda pattern for generating identity sequences
- Each lambda expression creates a unique closure type, causing separate
template instantiations at every call site
- Named helper shares a single type across all uses

### Part 2: Named functors in transform_tensor_descriptor
- Add `unpack_and_merge_sequences` helper to replace lambda in
`GetNumOfHiddenDimension`
- Use `generate_identity_sequences` in `matrix_padder.hpp`

## Test Plan

- [x] Added 7 unit tests:
  - 4 tests for `generate_identity_sequences`
  - 3 tests for `unpack_and_merge_sequences`
- [ ] Waiting for full CI

## Related PRs

This PR merges the functionality from:
- ROCm/composable_kernel#3588 (generate_identity_sequences helper)
- ROCm/composable_kernel#3589 (Named functors in
transform_tensor_descriptor)

Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times)

**Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and
ROCm/composable_kernel#3589, which can be closed once this is merged.
2026-02-28 20:11:11 +00:00
Bartłomiej Kocot
ef82340e05 [rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped
 kernels (#4875)

## Motivation

Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.

## Technical Details

Add the same instances as non-grouped but using grouped kernel.

## Test Plan

test_grouped_convnd_fwd

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-724
2026-02-28 01:25:33 +00:00
kabrahamAMD
5e06874aae [rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory
 (#4582)

This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
2026-02-27 03:06:29 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Illia Silin
1915cdfcc2 [rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in
 GridwiseGemm_wmma_cshuffle_v3" (#4762)

Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:

In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:

/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
  553 |                 if(!GridwiseGemm::CheckValidity(
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-02-20 22:41:34 +00:00
linqunAMD
29781f2ac4 [rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3
 (#4638)

1. Add GetEstimateVgprCount to estimate the VGPR usage in
GridwiseGemm_wmma_cshuffle_v3
2. Add IsValidCompilationParameter to disable kernel which use too many
vgprs.
- Currently, the threashold is AvailableVgprCount * 1.25
3. Modify examples to avoid test is disabled on gfx11

It is port from internal repo
PR[#192](https://github.com/ROCm/composable_kernel/issues/192)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-20 15:57:18 +00:00
Márton Bidlek
7b97e197ef [rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.

The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.

At this stage, the focus is functional correctness and compatibility,
not performance tuning.

## Technical Details

- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-19 08:13:46 +00:00
lalala-sh
dae352e8dc [rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93)
add memsetasync for ck moe splitk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

add memsetasync for ck moe splitk to fix

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-12 17:45:52 +00:00
Johannes Graner
e88f139c6c [rocm-libraries] ROCm/rocm-libraries#4271 (commit 6fce58e)
[Conv] Add NumGroupsToMerge to BwdWeight type string
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Add parameter to bwd weight V3 type string showing the number of groups
to merge. This is required for MIOpen to be properly tuned since it uses
type strings for performance database entries.

In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-11 09:08:38 +00:00
Ville Pietilä
57d26db844 [rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504)
[CK] Add fwd conv group merging to v3 conv instances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.

On MI300 (`gfx942`) we get

| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
2026-02-08 11:35:56 +00:00
Enrico Degregori
984a3d1828 [rocm-libraries] ROCm/rocm-libraries#4372 (commit 738ffd7)
[CK] Workaround blockscale wp test failure

## Motivation

Workaround to fix blockscale wp test failure for pipeline v3

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-07 00:09:58 +00:00
Illia Silin
4dd4869fbf [rocm-libraries] ROCm/rocm-libraries#4361 (commit 37a74ef)
[CK]  a bunch of CI fixes.

## Motivation

Fixing some of the CK CI issues

## Technical Details

fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
2026-02-06 01:07:34 +00:00
Illia Silin
569640dc70 Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)" (#3705)
This reverts commit 301eb5cf08.
2026-02-03 09:52:14 -08:00
Illia Silin
8b56ffb6ae Fix one more lifetimebound error. (#3703)
* fix staging compiler errors

* fix clang format
2026-02-02 18:25:56 -08:00
Zoltán Lakatos
301eb5cf08 Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)
* device struct implementation

* added xdl grouped multi abd fixed nk testing

* wmma implementation fixed

* avoid unnecessary device mem allocation and code cleanups

* cleanup instances definitions

* wmma examples added

* code cleanups

* fix clang format

* typo and compilation fixes related to reference gemm

* fix compilation error due to std::remove_cvref_t

* added missing hip_check_error includes

* correction to example instances

* review commentes addressed

* removed split-k from testing

* code formatting

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 13:58:11 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
ApoorvaKalyani
70d71b1514 Test fix for gemm_b_scale_xdl_v3. (#3674) 2026-01-30 10:34:54 -07:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Zoltán Lakatos
565fea2645 fix undefined behaviour in softmax kernel (#3683)
Co-authored-by: root <zoltan.lakatos@streamhpc.com>
2026-01-30 15:22:54 +08:00
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Johannes Graner
fabac7e2c3 [Conv] Enable bwd weight splitk autodeduction with cap (#3656)
* Enable bwd weight splitk autodeduction with cap

* Fix error threshold calculations

* Add missing logic to wmma multiple d kernel

* Fix threshold calculation

* Update test with new applicability
2026-01-29 17:40:28 +00:00
Bartłomiej Kocot
83b58bb0c3 Grouped Conv Bwd Weight Direct Load (#3648)
* Grouped Conv Bwd Weight Direct Load

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

* Implement group merging for bwd_weight and add instances

* Link direct load instances

* builder fixes

* fix

* fixes

* fix

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
2026-01-28 15:31:54 -06:00
linqunAMD
23cefda140 [ck] add gridwise base class for in all xdl kernel (#186) (#3544)
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
2026-01-27 12:49:47 -08:00
Michał Kulikowski
b737f1dee5 [CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263)
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-27 10:48:16 -08:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
Enrico Degregori
2e49b6b2f7 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format
2026-01-26 12:57:09 -08:00
chris-tsiaousis-hpc
917f35553a Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617)
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation

This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-26 10:20:30 -08:00
Ville Pietilä
7ac3794284 Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639)
Co-authored-by: Ville Pietilä <>
2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc
e1c46ff548 Remove code duplications in batched gemm wmma (#3580)
* Moved device struct for batched gemm wmma to a common file

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Use the common device struct in the scaled batched gemm wmma implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: Remove unused includes and ambiguous comment

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved pointer offset calculation and gridwise argument to common struct

This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common"

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-23 12:39:03 -08:00
Wojciech Laskowski
81ee19bd2c WMMA grouped conv fwd large tensor extra flavors (#3582)
* Additional flavors for WMMA conv fwd large tensor

- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors

* changed strategy to handle GemmArgs array

* Adding generic instance

* Added generic instance to clamp and bias_clamp ops
2026-01-23 12:19:51 +01:00
Erwin Terpstra
d5ae81b292 Implement batched gemm add relu gemm add for rdna4 (#3391)
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation

* wip: many fixes in implementation of batched gemm gemm multiple d

* wip: batched gemm gemm multiple d gridwise op compiling, not working yet

* fix: incorrect d0 grid indexing in batched gemm gemm multipled

* feat: add instances for batched gemm add relu gemm add

* chore: configure instance with low vector transfer size for odd sizes

* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense

* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes

* fix: disable odd size tests on XDL archs

* chore: removed temporary logging

* chore: update some references to C tensor to E tensor

* Tentative fix for example template params

* Tentative fix for non-multi-D batched gemm gemm device impl.

* Tentative fix for xdl example template params

* Tentative fix for profiler build on gfx90a

* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions

* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply

* fix: make the gemm1 data types match what happens in the device op

* feat: add d0s/d1s datatypes and layouts to the device op type string

* chore: change element-wise op so addition happens in fp32

* chore: add static asserts for gemm0/gemm1 calculated wave sizes

* chore: also updated other element-wise ops to use fp32 calculations

* chore: log number of supported instances

* chore: update instance comment

* chore: disable kernel timing in example by default

* fix: gemm1 wave size calculation

* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions

* chore: remove increased tolerance in batched gemm gemm multiple d example

* chore: add comment explaining that verification fails for certain input values

* chore: clarify instance comment

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2026-01-20 13:06:59 -08:00
music-dino
6300ad3c62 Batched gemm softmax gemm descriptor fix (#3564)
* Add rocm to prefix path for codegen

* Fix issue with c0_matrix_mask construction
2026-01-20 07:25:30 -08:00
Wojciech Laskowski
b09121f860 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests
2026-01-20 10:50:46 +01:00
Erwin Terpstra
fe40a5d139 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
2026-01-17 08:30:27 +01:00
Yung-sheng Tu
6df2d70143 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples
2026-01-15 07:19:31 -08:00
John Shumway
5122637215 [CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547)
* Factor helpers out of conv_traits.hpp

* Create a non-templated conv_traits struct

* Migrate to new instance-specific instance_to_conv_traits functions

* Clean up reflection concepts

* Clean up ConvTraits helpers

* Update testing for convolution traits

This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.

* Address reviewer comments and resolve merge conflict
2026-01-15 10:03:21 +01:00
Bartłomiej Kocot
a346cfa960 Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566) 2026-01-14 12:37:12 -08:00
Bartłomiej Kocot
a07c8e38bd Fix grouped conv bwd data wmma check (#3562) 2026-01-14 11:04:37 -08:00
Enrico Degregori
693ff3bbb3 Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465)
- Add support for direct store in epilogue instead of cshuffle
 - Add padding support for wave transfer without transpose
 - Add wave transfer with interleaved layout to support direct store
 - Enable new functionalities on GEMMs
 - Add optional new functionality support for grouped convolution fwd
 - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)
2026-01-14 11:02:19 +01:00