39 Commits

Author SHA1 Message Date
Bartłomiej Kocot
7c2b979de2 [rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d)
[CK][CK Tile] Drop profiler for experimental builder codegen
 (#8573)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Switch to dispatcher profiler for ck tile conv.

## Technical Details

- Switch to dispatcher profiler for ck tile conv.
- Drop profiler for experimental codegen
- Minor fixes for bwd data printing
- Minor fixes for 3d conv in dispatcher codegen

## Test Plan

test_grouped_conv*tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 09:38:44 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
jakpiase
65bef78383 [rocm-libraries] ROCm/rocm-libraries#8518 (commit 1ad69c3)
[CK] Add support for large tensor index handling into conv
 bwd data (#8518)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-17 15:51:36 +00:00
Bartłomiej Kocot
28f2966762 [rocm-libraries] ROCm/rocm-libraries#7734 (commit 03ffb9d)
[CK] Grouped Convolution Global Load/Store instances

## Motivation

Support global load and store in grouped convolutions using instance
factory.

## Technical Details

- add new instances for each direction
- add new tests for large cases

## Test Plan

New test for large cases

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1255
2026-06-06 22:52:59 +00:00
Brock Hargreaves
b2a3ffea5d [rocm-libraries] ROCm/rocm-libraries#5945 (commit 8f9a5fe)
[CK] [MIOPEN] Split convolution library by layout
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

# Split Composable Kernel convolution operations by data layout

TLDR:

1. This is a reorganization of files, folders, and CMakeLists for
convolution kernels and facilitates a splitting of the convolution
library into layouts.
2. The speedup can range anywhere between 15-40% depending on the target
architecture for miopen only builds of CK. For TheRock nightly builds of
CK, which includes both miopen and hip tensor kernel instances, this
constituted in a 10% decrease in compile time for gfx1100.
## Overview

Based on https://github.com/ROCm/composable_kernel/pull/3010/ (except
keeping 1 static library)

## What MIOpen Actually Uses

MIOpen **exclusively uses:
- **NHWGC** for all 2D convolutions
- **NDHWGC** for all 3D convolutions
This is because MIOpen's tensor descriptors natively use channel-last,
group-aware formats.
## Key Changes

### 1. Layout-Based Directory Structure
Reorganized convolution instance files from flat per-operation to
hierarchical layout-based structure. For example:

**Before:**
grouped_conv2d_fwd/
├── device_grouped_conv2d_fwd_xdl_nhwgc_*.cpp (MIOpen-required)
├── device_grouped_conv2d_fwd_xdl_gnhwc_*.cpp (optional)
└── device_grouped_conv2d_fwd_xdl_ngchw_*.cpp (optional)
**After:**
grouped_conv2d_fwd/
├── nhwgc/ ← MIOpen-required
│   ├── xdl/device_grouped_conv2d_fwd_xdl_*.cpp
│ └── wmma/device_grouped_conv2d_fwd_wmma_*.cpp
├── gnhwc/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY)
└── ngchw/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY)
### 2. Preserved Umbrella Library
As before, all convolution operations are consolidated into a single
static `device_conv_operations` library:
- Aggregates layout-specific instance object files via
`ADD_CONV_LAYOUT_INSTANCES` macro
- **Default build:** Includes all layouts (NHWGC + GNHWC + NGCHW +
NDHWGC + GNDHWC + NGCDHW)
- **MIOpen build (`MIOPEN_REQ_LIBS_ONLY=ON`):** Includes only NHWGC and
NDHWGC layouts
### 3. Binary Size Reduction
When building with `MIOPEN_REQ_LIBS_ONLY=ON`:
**Layouts Included (26 targets):**
- 7× NHWGC instances (2D operations + variants)
- 19× NDHWGC instances (3D operations + variants)

**Layouts Excluded (16 targets):**
- 3× GNHWC instances (2D operations)
- 3× NGCHW instances (2D operations)
- 3× GNDHWC instances (3D operations)
- 3× NGCDHW instances (3D operations)
- 2× GNWC instances (1D operations)
- 1× NWGC instance (1D operations)
- 1× additional NHWGC instance (grouped_conv1d_fwd, not needed by
MIOpen)
This represents a **~38% reduction in instance targets** (16 excluded
out of 42 total
layout-specific targets).

### Testing
-  All existing CK tests link against the umbrella library
-  MIOpen links successfully with the reduced umbrella library
-  Profiler builds with all layout-specific targets explicitly listed

Notes from the Author:

Since this refactor moved most of the convolution files further into
subdirectories, I concentrated on ensuring that no source files were
excluded, including sharded sources: Targets are correctly migrated — no
missing targets, no shard count mismatches.
2026-06-05 15:09:20 +00:00
Illia Silin
aef7b42883 [rocm-libraries] ROCm/rocm-libraries#7816 (commit f6324af)
[CK] Fix latest build issues with staging compiler.

## Motivation

Fixing new warnings with staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-04 17:41:09 +00:00
Illia Silin
c24e528481 [rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)
[CK] suppress compiler warnings while building pytorch. (#7760)

## Motivation

Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 06:56:58 -07:00
JH-Leon-KIM-AMD
00e1d82ae7 [rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9)
[CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732)

## Motivation

Fix silent miscompute in the grouped convolution backward-data kernel
(`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride >
dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast
path that dropped all but the first sub-GEMM, producing zeroed slices of
`dx` on
the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness
without giving up the perf gains PR #6208 delivered on stride=1 shapes.

## Technical Details

- Tighten the flat-descriptor fast-path gate to require
`arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its
original purpose). For stride > 1, the implicit GEMM is split into
`gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly;
routing them through the flat path required dropping all but the first,
which was the source of the bug.
- Stride > 1 now falls through to the existing grouped CShuffle path,
which packs all sub-GEMMs into one descriptor array and walks them
on-device in a single kernel launch. This is the pre-PR-6208 production
path; correctness is established and per-dispatch launch count is
minimised.
- Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0)
intersection in
`test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with
`gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this
intersection (all stride>1 cases used G=2; all G=1 cases used stride=1),
which is why PR #6208's regression slipped past CI.

## Test Plan

- `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke
tier — runs on every PR via `smart_build_and_test.sh`).
- End-to-end verify (`verify=1`) via
`example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes
including the original ALMIOPEN-1959 case and a cross-bucket
(`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32`
buckets.
- ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate
via an environment variable: full kernel-family enumeration, winning
kernel + its avg_time reported under each gate. 33/41 shapes completed
before the sweep was stopped; the remaining 8 were the largest
i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape
enumeration budget (not relevant to the verdict).

## Test Result

### Correctness

| Test | Result |
|---|:---:|
| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D,
includes 3 new regression shapes) | **12/12 PASSED** in 14.18 s |
| `test_grouped_convnd_bwd_data_interface` (API checks) | **PASSED** in
0.28 s |
| ALMIOPEN-1959 stride=2 (`verify=1`) | **PASSED** |
| stride=1 K3 (`verify=1`) | **PASSED** |
| stride=3 K3 `gemms_count=9` (`verify=1`) | **PASSED** |
| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) | **PASSED** |

### Performance (ckProfiler A/B on gfx942 / MI300X)

Comparing the **post-fix gate** (flat path only when `gemms_count_==1`,
column "B") vs the **inner-loop variant** that keeps the flat path on
stride>1 (column "A") across 25 stride>1 shapes where production picks
a `_v1` instance (so the gate actually fires):

| Stride | Shapes | A wins | Tie | B wins | Notes |
|:------:|:------:|:------:|:---:|:------:|---|
| 1 (sanity, gate moot) | 3 | 0 | 3 | 0 | gate doesn't differentiate — A
== B as expected |
| > 1 (gate fires) | 25 | **0** | 11 | **14** | B wins +6% to +32%; A
never wins |

Highlights from the firing-gate cases:

| Shape (G=1, stride=2 unless noted) | A ms | B ms | B vs A |
|---|---:|---:|---:|
| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) | 0.183 | 0.171 | **B
+6%** |
| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) | 0.054 | 0.045 | **B +17%**
|
| i2v-010 (N=1, K=C=384, 3×3, 277×209) | 0.174 | 0.125 | **B +28%** |
| Synthetic 50×50 K3 N=32 K=C=256 | 0.131 | 0.088 | **B +32%** |

Why B wins everywhere the gate fires: for `gemms_count = N`, the flat
path needs N kernel launches (one per sub-GEMM), while the grouped path
loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) ×
launch-tax is a structural disadvantage A can't recover from.

### Diff

| File | Lines |
|---|---:|
|
`include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp`
| +14 / −8 (one extra condition + expanded dispatch comment) |
| `test/.../test_grouped_convnd_bwd_data.cpp` | +9 / −0 (3 new shapes) |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 09:59:14 +03:00
Bartłomiej Kocot
ebb97044f4 [rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1)
Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664)

## Motivation

Incorrect results has been introduced for some conv bwd cases.

## Technical Details

This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3.

## Test Plan

CI

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ALMIOPEN-1959
2026-05-22 12:28:49 +00:00
JH-Leon-KIM-AMD
9a5d1ea791 [rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6)
[CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208)

## Motivation

Improve performance of grouped convolution backward-data kernels to
match non-grouped kernel performance for G=1 cases.

## Technical Details

- Add NoShuffle epilogue path (direct VGPR→Global writes) by setting
`CDEBlockTransferScalarPerVector_NPerBlock = 1`
- Add nongrouped-match instances with optimized BBlockTransfer
parameters for better thread utilization
- Add packed (flat) descriptor path for G=1 2D convolutions, using
simpler tensor descriptors with fewer transform layers to reduce address
computation overhead in the GEMM main loop
- Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX
zeroing cost)

## Test Plan

- Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16)
- Verify correctness with existing conv bwd data tests

## Test Result

| Metric | Before | After |
|--------|--------|-------|
| Mean ratio (grouped/nongrouped) | 1.159 | **1.028** |
| Median ratio | 1.142 | **1.026** |
| Cases within 2% | 26 (4.4%) | **186 (31.8%)** |
| Cases >20% slower | 188 (32%) | **2 (0.3%)** |

NoShuffle + nongrouped-match instances achieve **~2.8% average gap**
with non-grouped kernels (down from ~16%).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com>
Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com>
Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 06:49:50 -07:00
Illia Silin
717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
2026-05-15 06:46:51 -07:00
Illia Silin
ac18460782 [rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70)
[CK] Suppress new staging compiler errors (#7384)

## Motivation

This should make new builds with staging compiler pass.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 12:51:08 -07:00
Illia Silin
22b9feb40f [rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f)
[CK] Fix latest batch of staging compiler warnings (#7111)

## Motivation

Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 07:14:14 -07:00
Bartłomiej Kocot
9348b8eb82 [rocm-libraries] ROCm/rocm-libraries#5646 (commit 05680a4)
[CK_TILE] Add conv bwd data tests (#5646)

## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.

## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.

## Test Result
All implemented tests are working properly and passing.

---------

Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
2026-04-21 21:49:19 +00:00
Bartłomiej Kocot
9e42e3b57f [rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1)
[CK] Fix unused param mask (#5856)

## Motivation

Compiler error caused by unused param mask.

## Technical Details

Skip tests using param mask in test loop.

## Test Plan

Current test improvements.

## Test Result

Passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 03:57:22 +00:00
JP-Fernando
0bad76b1de [rocm-libraries] ROCm/rocm-libraries#4415 (commit b3b4af7)
[CK] Remove duplicated XDL/WMMA tests (#4415)

## Motivation

When we started the RDNA4 support, the XDL instances were not supporting
WMMA instructions, so we duplicated some tests.

In this issue, we simplified most of the duplicated test files into
common test files.

## Technical Details

The following tests were unified:

- `batched_gemm`

- `batched_gemm_gemm`

- `gemm_add`

- `gemm_universal`

- `grouped_convnd_bwd_data`

The following tests were duplicated exactly, and copied into two files
with `_xdl` and `_wmma` suffixes. Now they are unified in one single
file without suffix:

- `gemm_multi_abd`

- `gemm_b_scale`

There is still an apparent duplication which is a special case, namely
`test_grouped_convnd_bwd_weight_interface_{suffix}` where `{suffix}` is
`xdl` or `wmma`.
However, the WMMA code relies on an old implementation, and is expected
to be removed in the future. In addition, it differs from the XDL
implementation significantly.
Therefore, it was decided to keep both files separate instead of
attempting any unification.

## Test Plan

`CMakeLists.txt` files were modified to support the new, unified tests.
In particular, testing was done for `gfx90a`, `gfx1201` and `gfx11`
architectures.

## Test Result

All tests passed successfully on all three tested architectures.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-02-25 23:22:05 +00:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
ApoorvaKalyani
53a1e4f551 Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3.

* Added first instance of bwd_data_wmma_v3(f16).

* Add support for bwd data in gridwise implementation

Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight

* Initial device implementation of bwd data

* Remove unused template parameters in device impl

* Add one instance for different layout

initial check of device implementation

* Add tests for splitk and for different layouts

* Appended more instances to wmma_v3_f16.

* Added conv_2d bf16 wmma_v3 instances.

* Added conv_3d_bf16 wmma_v3_instances.

* Added conv_3d_f16_wmma_v3_instances.

* Added SplitN test cases for wmma.

* Conv3d_bwd_data_scale_wmma_v3 instances.

* Conv3d_bwd_data_bilinear_wmma_v3_instances

* Renaming the device level instances file to common name , since it is defined for different DataTypes.

* Renaming the instances and fixing typo

* Added the test cases to regression test list

* NCHW support for wmma_v3

* Examples for bf16 and f16 bwd_data_wmma_v3

* Added transpose conditons for device impl

* fixing bugs

* Added the gemm_args array implmentation

* WIP debug conv bwd

* fix splitk

* Grouped gemm fix

* Update CmakeLists with EOF

* Added more instances for tests

* Fixed the run time error in examples and removed 3d conv examples.

* Fixed a typo.

* Updated CmakeLists to removed the 3d convultion deleted files

* Added print error statements for unsupoorted argument

* Added the merge conflict related changes

* Fixed compilation error

* Fixed the InstanceFactory duplication error.

* Removed the print statements and added logs to Arg function

* All the merge conflict related errors resolved

* Added d_tensor tests.

* Added the missing example types of wmm_v3

* Merge error fix

* Corrected the instance name

* Reverted the bias relu change

* Revereted the transpose load local change

* Updated the regression test list with bwd_data_scale

* Revert "Revereted the transpose load local change"

This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.

* Revert "Merge error fix"

This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.

* Reverting the local change

* Added merge error fix

* Build error fix due to merge conflicts

* Added bias_relu example for wmma_v3

* Modified the main method in dtensor tests

* Updated the dtensor tests to pick all the shapes

* Updated the dtensor test shapes.

* Updated the mem operations in tests.

* Added reference func

* Fixed typos in device impl

* Added new header file and modified the include file for 3d tests

* Renamed the test file and added reference func call.

* clang format fix

* Added ignore params

* Modified device impl and tests

* Removed debug print statements and updated dtensor test shapes

* Fixing merge conflicts

* Fixing more merge conflicts

* Fixed copyrights

* Updated the tuned instances to bilinear and scale.

* Adding tuned instances to vanilla wmma_v3

* Removed all unused instances and modified test layouts.

* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.

* Fix clang format

* Updated tuned f16/-genric instances

* Formatting the instances file

* Fixed copyrights and clang issues

* Nonsense commit to force git to force

* Removed the transpose instances

* Added verified genric instances

* Fixing namespace errors

* Added todo for failing shapes

* Formatting instance file

* Fix instance list formatting

* Removing unnecessary formats

* Renamed the common file

* Unification of xdl and wmma bwd_data tests

* Updated Cmake

* Added all layout types and deleted code.

* Updated Cmake to add the condition to all tests.

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2025-12-30 16:25:08 +01:00
Johannes Graner
bb8445dca8 [CK] Integrate GPU reference into ckProfiler for convolutions (#3379)
Refactor and integrate CK GPU references into ckProfiler.
- All convolution layouts and groupings supported for all three directions
- Unit tests verifying GPU and CPU reference is the same
- Support added to profiler (do_verification = 2 enables GPU reference)
- One profiler-based test per direction changed to GPU reference to demonstrate usag

Closes AICK-427
2025-12-18 07:59:45 +01:00
Johannes Graner
fe35ba5dac Add grouped convnd dataset tests for bwd_data, bwd_weight and make them parallel (#3380)
* Parallelization in dataset generation

* Parallelizable tests for fwd, bwd data, bwd weight with datasets

* .gitignore generated datasets

* Test parallelization script with round-robin GPU scheduling

* Parallelization updates to test generation and running

* Dataset paths relative to executable

* Update output from test generation

* Default to one GPU in test generation

* Add small dataset tests to Jenkins

* Update copyright lines

* Update test_data/generate_test_dataset.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Move trap disable

* Common get path function

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-15 13:38:25 +01:00
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Aviral Goel
f6c999bddb chore(copyright): update copyright header for test directory (#3265) 2025-11-22 19:38:27 -05:00
Ville Pietilä
7e44b845b5 Fixed handling of split-K autodeduce argument for grouped convolution (#3024)
* Fix handling of split-K autodeduce argument.

* Fix clang formatting.

* Test fix.

* Fix clang formatting.
2025-10-17 15:36:39 +03:00
linqunAMD
e78a897ec0 [CK] Add command option instance_index and param_mask to run partial ck test (#2889)
* [CK] Add command option instance_index and param_mask to run partial ck test

Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test

instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask

* fix CI error

* fix clang format

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-09-30 08:24:40 -07:00
linqunAMD
f22740df82 Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12

* update cmake file

* fix all instance build (cmake)

* fix batched_gemm_gemm(cmake)

* rebase cmake files

* fix cmake build error

* remve CK_ENABLE_DYNAMIC_WARP_SIZE

* update cmake build error2

* fix gfx11 build

CK_USE_XDL is enabled on gfx11 and gfx12

* fix gfx10 build

* fix gfx11 error

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-09-15 10:59:25 -07:00
linqunAMD
321627aec5 Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724)
* Fix example

* fix build error

* update pk_i4 & moe test case

* fix all instance build (examples)

* fix batched_gemm_gemm (example)

* disable example_gemm_bias_softmax_gemm_permute on gfx11

* remove unnecessary disable gfx11

* update tests

* update tests2
2025-09-12 08:17:07 -07:00
Bartłomiej Kocot
bb4f471b09 Grouped conv bwd weight with grouped gemm (#2304)
* Grouped conv bwd weight with grouped gemm

* fixes

* fix

* Fixes

* test comments

* restore atol

* fix
2025-06-12 10:15:07 +02:00
Bartłomiej Kocot
8c1ed6f4c1 Move SetZero functions inside the kernels for Grouped Conv (#2255)
* Disable SetZero before launch kernel for grouped conv fwd

* Move set zero to kernel

* wmma fix

* fix

---------

Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
2025-06-11 23:41:03 +02:00
Bartłomiej Kocot
4094ad158a Integrate universal gemm with conv bwd data and add SplitK (#1315)
* Integrate universal gemm with conv bwd data

* Fix multi d kernel

* Add splitK support

* instances refactor

* instances refactor

* refactor

* fixeS

* fixes

* 16x16 instnaces

* Fixes

* Fix

* Fix

* Fix

* Fix

* Fix

* Fixes

* fix

* fix
2025-04-28 23:54:49 +02:00
Bartłomiej Kocot
8c0ab61ece Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
c2e4898b4b Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example
2025-03-17 13:32:00 +01:00
Bartłomiej Kocot
261f1759de Support large batch tensors in grouped conv bwd data (#1711)
* Support large batch tensors in grouped conv bwd data

* Fix multiD

* fixes

* fixes

* fixes
2024-12-06 10:55:23 +01:00
Jun Liu
959073842c Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) 2024-07-03 23:34:38 -07:00
Illia Silin
ae57e5938e Split the instances by architecture. (#1223)
* parse examples inside the add_example_executable function

* fix the example 64 cmake file

* add xdl flag to the gemm_bias_softmax_gemm_permute example

* add filtering of tests based on architecture type

* enable test_grouped_gemm for gfx9 only

* enable test_transpose only for gfx9

* only linnk test_transpose if it gets built

* split the gemm instances by architectures

* split gemm_bilinear,grouped_conv_bwd_weight instances by targets

* split instances by architecture

* split grouped_conv instances by architecture

* fix clang format

* fix the if-else logic in group_conv headers

* small fix for grouped convolution instances

* fix the grouped conv bwd weight dl instances

* fix client examples

* only enable client examples 3 and 4 on gfx9

* set the gfx9 macro

* make sure the architecture macros are set by cmake

* use separate set of xdl/wmma flags for host code

* sinmplify the main cmake file

* add conv_fwd_bf8 instance declaration
2024-04-02 09:42:17 -07:00
zjing14
1837040a9c Navi3 rel (#1176)
* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

* navi3x_multipleD+example

* temp save

* workable

* batchedgemm[OK], groupconv[debug]

* groupconv: Sanity check[OK], Performance[Bad]

* navi3x_groupconv_need_optimization

* create necessary files

* save progress

* Add Inter-Row thread transfer

* save progress

* save debugging progress

* sanity check pass

* fix a host tensor bug and clean up flash-attn code

* format

* cancel unnecessary change

* cancel unnecessary change

* cancel unnecessary change

* temp save, add asm backend flag to amd_wmma

* Mat-A LDS Bypass sanity pass

* temp save

* gemm sanity fix

* Porting new blockwise gemm to flash attention

* Example branch provide to compiler team

* tempsave

* Fix a bug

* batched gemm ported

* conv A-skip lds ported

* Skip B-Lds real gemm

* Skip B Lds Gemm + MulD

* batched gemm, conv, skip b lds

* format

* Attn, skip b lds

* Change GridwiseOp nam

* fix a typo caused bug

* Skip A_Lds sanity pass, Skip B_Lds scratch occured

* Bug found, intra-row permute off caused

* bug found

* a fix

* disable buffer load due to incorrect 3rd dword

* update fmha config, no scratch generated

* update 3rd dword

* fmha config update

* FMHA, add support to gfx1101/gfx1102

* Merge origin dev (#2)

* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af91.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912

* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* add vector load check

* solve conflicts

---------

Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

* Disable SkipLDS & Align AIT api (#3)

* fix layernorm, reduction Ops (#4)

* [Navi3x] Fix Gridwise_multiple_d operation (#649)

* Add CMake Option "USE_OPT_NAVI3X"

* fix bug

* standardize docs (#655)

* Separate bibtex requirement from rocm-docs-core (#656)

* separate bibtex requirement from rocm-docs-core

* point requirements to source rocm-docs-core repo

* Add CMake Option "USE_OPT_NAVI3X" (#647)

* Add CMake Option "USE_OPT_NAVI3X"

* remove navi3x opt compile option from cmake script

* Conv + quantization + tanh  (#645)

* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>

* Add a denorm test fix (#603)

* Add type_convert implementations for bf16

* Add the fix for conv_fwd

* Add the fix for conv_bwd_data

* Add the fix for conv_bwd_weight

* Format

* Format

* Another format

* Add a macro to use workaround on MI200 only

* Format

---------

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

* simplify karg in device/grid of split-k op (#644)

* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

* fix 3rd dword of buffer source descriptor (#659)

* add fp64 instances (#658)

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)

This reverts commit bb5530af91.

* Groupnorm + swish external api (#668)

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

* add a marco to turn on/off denorm fix (off by default) (#673)

* add a marco to turn off denorm fix by default

* expose the marco

---------

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* fixed quant example (#672)

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

* Add dependabot config and pin rocm-docs-core (#663)

* [gtest] suppress unsafe buffer warn (#670)

ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912

* Add memory index guard in wmma device ops (#667)

* Add more macros to turn on/off denorm fix (#678)

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>

* Fix a typo (#676)

* Add (#677)

* Allow using ROCm release candidate compilers. (#679)

* enable use of rocm5.5 release candidate 4

* upgrade to ROCM5.5 RC5

* try fix the PUB_KEY error, remove the cmake-data package

* upgrade to latest cmake version

* use private dockerhub repo for rocm5.5 rc5

* add missing bracket

* Disable SkipLDS & Align AIT api

* Update dependabot config (#682)

Co-authored-by: samjwu <samjwu@users.noreply.github.com>

* update attn api

* solve type_convert bug + enable

---------

Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
Co-authored-by: haocwang <Haocong.WANG@amd.com>

* fix typo

* Fix attention with causal mask

* multiple fix, try ait compile

* Add A/B not use LDS pipeline

* Clang format, Add gfx1101, gfx1102 support of FMHA example

* cancel change of format script

* 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
2. Enable FP16 accumulator blockwise_gemm

* clang-format

* 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
2. change kernel timing mode to 50 warmup + 50 timed repeat

* Update low level abstration of blockwise gemm wmma

* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (4/5) grouped conv pass

* (5/5) attention pass, todo: debug lds perf bug

* AIT Attention API refactor (#8)

* sanity pass

* sanity pass 2

* confirm significant performance regression.

* turn on all instances

* turn off instance format

* Fix bug & tunning & format

* DML meta, self_attn+cross_attn

* sanity pass

* remove useless flag

* update tile and problem size used in AIT attention

* bug fix in grouped conv supporting check

* deprecate inline asm wmma

* Bug fix: double lds skip

* clang-format

* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array

* part2 of previous commit

* clang format

* API fix of gridwisegemmpipeline

* separate array base and vector base attention tensor transformation

* fix gemm

* clang format

* add gemm fp16 instances

* Temp save

* fpAintB kernel compile pass

* Sanity pass.

* Temp save

* debug code enabled

* Fp16AInt8B_GEMM sanity

* MQA implementation

* GQA-4 example

* tempsave

* Compile pass

* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm

* format

* Todo: fix gemm_bilinear_wmma instances compilation bug

* Solve a bug when K1=16

* remove unnecessary changes

* Remove tensor layout limitation to LDS usage in tesnor contraction

* update self-attention and cross-attention

* fix a typo of name

* Add arch limiter for fp8 gemm

* enable fp8 gemm_xdl for all gfx9 targets

* temporarily disable gemm_xdl_fp16_fp8 on MI100/200

* fix the cmake logic for gemm_xdl_fp16_fp8

* re-enable the gemm_xdl_fp16_fp8 on MI100/200

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
Co-authored-by: haocwang <Haocong.WANG@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-03-08 17:11:51 -08:00
Bartłomiej Kocot
cb53874002 Add grouped conv bwd data wmma (#950)
* Add grouped conv bwd data wmma

* Fix copyrights

* Add instances with smaller NPerBlock

* Update interface test

* Minor stylistic fixes

* Minor stylistic fixes
2023-09-28 23:10:18 +02:00
Bartłomiej Kocot
595d23be14 Fix transform and instances for grouped conv bwd data (#848)
* Fix transform and instances for grouped conv bwd data

* Add instances for small K and small C

* Remove workaround after fix

* Fix interface tests
2023-08-22 11:25:41 -05:00
Bartłomiej Kocot
49180fd60b Grouped 3d conv backward data support (#799)
* Grouped 3d conv backward data support

* Fix comments
2023-07-18 11:01:33 -05:00
Bartłomiej Kocot
63388e84ab Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757)
* Support bf16/f32/f16 and NHWGC conv2d_bwd_data

* Add interface test

* clang format

* Comment fixes

* Add more friendly error message
2023-06-21 08:20:31 -05:00