[CK_TILE] Grouped Convolution Backward Data Direct Load
(#6624)
## Proposed changes
Add Grouped Convolution Backward Data with Direct Load into
DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation.
This enables direct global memory loading (bypassing LDS) for the
backward data convolution path on gfx950, following the same pattern
used in both backward weight and forward convolution.
Direct load convolution backward data improves performance by avoiding
LDS round-trips for certain configurations on gfx950, which supports a
wider range of instructions. Currently correctness is checked only at
usage point, but should be extended to a standalone UT in the future.
[MIOpen][CK] Fix bwd weight conv test failures by disabling
one block-GEMM V5 instance for 3D convs (#6421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current develop branch, the test
failures are blocking any PR merges into develop. An example of a failed
CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as the tests are passing for clang compiler version 20.0 and
failing for clang compiler version 22.0.
First attempt to fix this problem had to be reverted in #6400 because it
broke MIOpen internal DB sync tests.
## Technical Details
The root cause for the test failures are the block-GEMM V5 instances of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size.
The V5 pipeline uses double register buffer that in combination with
large tile size causes high register pressure. The latest version of
compiler handles the register spillage incorrectly for `gfx90a`, which
cause the kernel to output incorrect results.
The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3`
that do not use direct load for are divided into two groups
- Base instances
- Instances that result into high register usage (currently only one
instance - one that causes the test failures).
This division allows to disable only the V5 block-GEMM flavor of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The
selective disabling leaves the set of instances for 1D and 2D
convolutions unaffected, and removes at runtime two V5 block-GEMM
instances (`ConvBwdWeightDefault` and
`ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the
device is `gfx90a`.
Because MIOpen uses CK's type string (provided by method
`GetTypeString`) to identify the instances, the DB sync tests are
expected to unaffected since there are still the V2 block-GEMM instances
that result in the same type string
(`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by
running the MIOpen DB sync tests that are not part of the normal CK PR
build.
## Test Plan
Running all CI tests + the MIOpen internal DB sync tests is sufficient
to verify the correctness of the code changes.
## Test Result
Verified locally that the previously failing tests
`TestGroupedConvndBwdWeight3d/4.Test3D` and
`TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts
- 231 on `gfx90a`
- 233 on `gfx942`
and are currently passing. This confirms the expectation that two
instances per data type should be disabled on `gfx90a`.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Ville Pietilä <>
[MIOPEN] [CK] Revert "[CK] Disable test cases affected by
compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
[CK] Disable compilation of problematic bwd weight conv
instances for gfx90a (#6343)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
CK: Extract shared boilerplate from 47 gemm_quant test files
(#6323)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Depends on #6303
## Summary
Extract shared test boilerplate (includes, type aliases, test fixture
macros) from 47 `test_gemm_quant_*` files into a single
`test_gemm_quant_common.hpp` header. Each test file is reduced from ~50
lines of boilerplate to ~5 lines.
| Metric | Value |
|--------|-------|
| Files changed | 48 |
| Insertions | +413 |
| Deletions | −1,106 |
| **Net lines removed** | **−693** |
### What changed
| Before | After |
|--------|-------|
| 47 test files, each with ~50 lines of identical includes, type
aliases, and fixture macros | 1 shared header
(`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include +
params) |
### Readability assessment
A code realist review confirmed this change **improves readability**:
the 47 test files had identical boilerplate obscuring the only
meaningful content — the `GemmConfig` type alias and test dimensions.
After the refactoring, each file's unique configuration is immediately
visible, and adding a new test variant requires specifying only the
varying parameters instead of copying 50 lines.
### Cumulative cleanup series stats
| PR | Description | Net lines |
|----|-------------|-----------|
| #6300 | Remove 61 dead `#if 0` blocks | −2,648 |
| #6302 | Remove 41 commented-out dead code blocks | −2,861 |
| #6303 | Remove 4 orphaned files | −3,886 |
| This PR | Extract gemm_quant test boilerplate | −693 |
| **Total** | | **−10,088** |
CK: Remove 4 orphaned files with verified replacements
(~1,025 lines) (#6303)
Depends on #6302
## Summary
Remove 4 orphaned files that have verified replacements already in the
build.
| File | Reason | Replacement |
|------|--------|-------------|
| `test_gemm_pipeline_compiler.cpp` | Refactored into 13 smaller tests |
`_compv3`, `_compv4`, `_mem`, `_persistent`, etc. |
| `test_grouped_gemm_quant.cpp` | Refactored into 5 smaller tests |
`_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. |
| `..._f8_f8_f16_..._comp_default_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
Each deletion was verified:
- Original file is NOT in any CMakeLists.txt
- Replacement files ARE in CMakeLists.txt and actively compiled
- Content is fully covered by the replacement files
[CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used
in addition to MIOPEN_REQ_LIBS_ONLY (#5675)
## Motivation
When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.
## Technical Details
The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).
## Test Plan
Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.
## Test Result
The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Add BF16^3 support to grouped conv bwd weight: bilinear
and scale (#4591)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Until now, XDL grouped conv bwd weight for bilinear and scale only
supported bf16f32bf16.
Therefore, bf16bf16bf16 support should be added.
## Technical Details
Instances were added to the relevant files in
`library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/`
folder. In addition, `add()` functions were included in new files in
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/`
and
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/`
folders. The new .cpp files were also included in the `CMakeFiles.txt`
files of both folders.
## Test Plan
Execute `grouped_convnd_bwd_weight` tests to check execution on
different architectures.
The tests for bilinear and scale already include the tuple
`std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in
principle, there is nothing to modify in the tests themselves.
## Test Result
`gfx1201`: Tests passed.
`gfx1100`: Tests passed.
`gfx90a`: Tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Replace tuple value construction with tuple_element_t
type extraction [1A] (#5030)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
### Rationale
CK's device operation instance registration uses
`add_device_operation_instances` at ~1,850
call sites to register GPU kernel configurations. The existing
implementation constructs
`std::tuple` values just to extract their types via `decltype`, then
copy-constructs each
instance into `make_unique`. This is wasteful — only the types matter,
not the values — and
forces the compiler to instantiate the full `std::tuple` constructor and
`std::get` machinery
at every call site.
### What changed
- Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with
`std::tuple_element_t<i.value, TupleType>`, which extracts the type
directly without constructing any values
- Replace copy-from-default `make_unique<T>(value)` with direct default
construction `make_unique<T>()` — all CK device operation instances are
stateless structs with configuration encoded in template parameters
- Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to
enforce this contract at compile time with a clear error message
- Add Doxygen documentation for this high-traffic public API
### Value
- Eliminates unnecessary template instantiation of `std::tuple`
constructors and `std::get` across ~1,850 call sites
- Establishes a cleaner, more intention-revealing pattern for type-only
tuple usage
- The `static_assert` prevents silent breakage if a
non-default-constructible type is ever added
- No runtime behavior change — zero risk
### Files changed (9)
- `add_device_operation_instance.hpp`: Core pattern change
- 3 example files, 3 reduce instance headers, 1 convolution header, 1
profiler header
## Test plan
- [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce,
softmax, convolution)
- [ ] `static_assert` provides compile-time validation stronger than
runtime tests
- [ ] No runtime behavior change — stateless struct default construction
is identical to copy-from-default
- [ ] Compatible with both `std::tuple` and `ck::type_list` containers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Remove log spam for deprecated convolutions
## Motivation
The `CK_BUILD_DEPRECATED` flag guards legacy non-grouped convolution
instances, but both branches of every guard emit a `#pragma` message on
every build, adding noise without actionable information. According to
some recent testing, these non-grouped instances can outperform their
grouped replacements in certain configurations, so their continued
availability behind the flag remains valuable. This change removes only
the warning directives while preserving all guards and guarded code
paths.
## Technical Details
Removed all `#pragma` message lines referencing deprecated instances
from 25 convolution instance source files spanning conv1d_bwd_data,
conv2d_fwd, conv2d_bwd_data, conv2d_fwd_bias_relu,
conv2d_fwd_bias_relu_add, conv3d_bwd_data, grouped_conv3d_fwd,
grouped_conv3d_bwd_data, and grouped_conv3d_bwd_weight. The `#if
CK_BUILD_DEPRECATED` / `#else` / `#endif` preprocessor guards and all
guarded code remain unchanged.
## Test Plan
No functional change. The CK_BUILD_DEPRECATED conditional logic is
unmodified; only #pragma message directives were removed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK] Port non-grouped convolution instances to the grouped
kernels (#4875)
## Motivation
Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.
## Technical Details
Add the same instances as non-grouped but using grouped kernel.
## Test Plan
test_grouped_convnd_fwd
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-724
Implement device_grouped_gemm_fixed_nk_bias for RDNA4
## Proposed changes
Summary:
- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests
**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Implement device grouped gemm fixed nk multi abd for
rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Add support for grouped gemm multi ABD fixed NK. MR
## Technical Details
Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.
Fix epilogue selection in device implementation that caused unit test
failures
## Test Plan
Covered by added unit tests
## Test Result
CI successfully passing
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.
The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.
At this stage, the focus is functional correctness and compatibility,
not performance tuning.
## Technical Details
- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.
## Checklist
Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
Revert "[CK] Add new fwd conv fp16/bf16 instances optimized
for unit group size." (#4652)
PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for
`gfx950` and they do not compile for other architectures such as
`gfx940`. To ensure that the optimized instances are compiled only for
`gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This
approach works correctly when we compile for a single architecture, but
when we compile simultaneously for multiple architectures, flag
`CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the
multi-arch compilation fails. The problem doesn't appear in the ROCm
libraries CI/CD pipeline since only one architecture is compiled at a
time. Hence, the CI/CD passed for the original PR.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK] Work around staging compiler lifetime warning
## Motivation
The staging compiler enables lifetime-safety warnings and we already
worked around a few of them.
This works around a few more instances that came up recently on gfx950
builds.
The initial PR that resolved most issues:
https://github.com/ROCm/composable_kernel/pull/3640
## Technical Details
This follows the pattern to locally ignore the newly added
lifetime-safety warnings that were moved from experimental to production
in upstream LLVM.
As a result, CK turned them on and treats them as errors, which prevents
the staging compiler from building CK.
## Test Plan
## Test Result
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Revert "[CK Conv] Add bwd weight instance for large-k shape"
(#4506)
Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be
investigated by @johannes-graner
[CK Conv] Add bwd weight instance for large-k shape
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
This instance improves the shape used in `./bin/ckProfiler
grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1
1 1 1 1 all` from 10.3 ms to 6.6 ms.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
* Adding remaining flavors for grouped conv fwd
As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu
* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.
* Do not build f8 / bf8 only flavor tests on RDNA3
* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.
* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.
* adding int8 and fp16 overloads to the elementwise operations
* fixed copilot nits
* Addressing review comments:
- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files
* clang-format
* reduced no of tests
---------
Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
* create a filter to build only libs required by hiptensor
* allow building libs for miopen and hiptensor at the same time
* tweak the lib filtering logic one more time
* Add multi AB support to wave transfer
* Improviments to multi ABD examples
* Add instances and use intrawave v1 instead of interwave
* Apply changes to other transfers
* Wave transfer: add support for multiple internal vgpr buffers
* Fix compilation error gfx11
* Add padding support with transpose
Also move check before writing storing is_src_valid during reading
* Add/modify instances to use wave transfer for gemm universal
Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer
* Fix clang format
* Modify example
* Fix bwd data
* Add restriction for wave transfer with padding and transpose
Add test case which shows this limitation
* Fix validity checks 8 bit types
* Add validity check gemm_bias_add_reduce
* Add validity check grouped gemm tile loop
* Fix validity checks new flavours
* Minor fixes
* Fix clang format
* Added bias_bnorm_clamp for WMMA conv fwd large tensor.
Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout.
- grouped_conv2d_fwd_bias_bnorm_clamp
- grouped_conv3d_fwd_bias_bnorm_clamp
* changed strategy to handle GemmArgs array
* Adding generic instance
* fixed last nits from reviewers and copilot
* Additional flavors for WMMA conv fwd large tensor
- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors
* changed strategy to handle GemmArgs array
* Adding generic instance
* Added generic instance to clamp and bias_clamp ops
* Added bias_bnorm_clamp instances.
* fwd_bias_bnorm_clamp comp instances
* fwd_bias_bnorm_mem_inter and mem_intra instances
* fwd_bias_bnorm_merged_group_instances
* fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances
* Device level changes for fwd_bias_bnorm_clamp
* Added the test to the regression test list.
* Removed the part 2 and 2x instances
* Removed the irrelevant checks in wmma
* Refactored the instances to adapt to new device implementation
* Updated the reference and include files
* enabling tests
* Added missing profiler
* Added missing instance entry , deleted by mistake
* Reduce bias bnorm clamp instances to only a single generic one.
* Clean up cmakelists file
* clang-format
* Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up.
* Renaming some instance lists and add functions to be more standardized.
* Commented out non default instances.
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation
* wip: many fixes in implementation of batched gemm gemm multiple d
* wip: batched gemm gemm multiple d gridwise op compiling, not working yet
* fix: incorrect d0 grid indexing in batched gemm gemm multipled
* feat: add instances for batched gemm add relu gemm add
* chore: configure instance with low vector transfer size for odd sizes
* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense
* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes
* fix: disable odd size tests on XDL archs
* chore: removed temporary logging
* chore: update some references to C tensor to E tensor
* Tentative fix for example template params
* Tentative fix for non-multi-D batched gemm gemm device impl.
* Tentative fix for xdl example template params
* Tentative fix for profiler build on gfx90a
* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions
* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply
* fix: make the gemm1 data types match what happens in the device op
* feat: add d0s/d1s datatypes and layouts to the device op type string
* chore: change element-wise op so addition happens in fp32
* chore: add static asserts for gemm0/gemm1 calculated wave sizes
* chore: also updated other element-wise ops to use fp32 calculations
* chore: log number of supported instances
* chore: update instance comment
* chore: disable kernel timing in example by default
* fix: gemm1 wave size calculation
* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions
* chore: remove increased tolerance in batched gemm gemm multiple d example
* chore: add comment explaining that verification fails for certain input values
* chore: clarify instance comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions
Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.
* Fix change to contraction scale tests
* Fix clang-format
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)
* wip: device struct for WMMA batched contraction multiple d based on new gridwise op
* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases
* fix: failure to resolve template parameters when calling new function overload
* fix: passing reference type as parameter instead of underlying types
* fix: merge error caused duplicate definitions
* fix: make sure constness of template and parameters types match
* fix: don't compile batched contraction test on unsupported architectures
* feat: add example for new wmma implementation, and consolidate example code between platforms
* style: return inline instead of with branch
* chore: add extra assert on vector memory access sizes
* chore: clean up some unused variables
* fix: correct tail number calculation, added small cases and extra instances to the test
* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
- Add support for direct store in epilogue instead of cshuffle
- Add padding support for wave transfer without transpose
- Add wave transfer with interleaved layout to support direct store
- Enable new functionalities on GEMMs
- Add optional new functionality support for grouped convolution fwd
- Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)
* feat: grouped gemm tile loop support for RDNA4
* fix: removed extra parameter from grouped gemm example instance
* fix: FP8 check incorrectly enabling FP8 on RDNA3
* Implement grouped gemm fastgelu for RDNA4
* chore: some cleanup and minor inconsistencies in grouped gemm profiler
* chore: clarified logic and reporting of supported instance warnings
## Proposed changes
This source file won't build for gfx1101 on Windows. It builds successfully on other gfx110X architectures, and also builds successfully on gfx1101 on Linux.
This is the compile error:
```
[composable_kernel] FAILED: library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj
[composable_kernel] ccache B:\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_WMMA -DCK_USE_XDL -DDPP_KERNELS -DLLVM_MAIN_REVISION=524190 -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/include -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/include -IB:/build/ml-libs/composable_kernel/build/include -IB:/build/base/half/stage/include -isystem B:/build/core/clr/dist/include -DWIN32 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_WARNINGS -DNOMINMAX -fms-extensions -fms-compatibility -D_ENABLE_EXTENDED_ALIGNED_STORAGE -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class -Wno-ignored-attributes -Wno-unknown-attributes -Wno-duplicate-decl-specifier --hip-path=B:/build/core/clr/dist --hip-device-lib-path=B:/build/core/clr/dist/lib/llvm/amdgcn/bitcode -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=c++20 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -x hip --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 -MD -MT library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -MF library\src\tensor_operation_instance\gpu\grouped_conv3d_bwd_weight_bilinear\CMakeFiles\device_grouped_conv3d_bwd_weight_bilinear_instance.dir\wmma\device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj.d -o library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -c C:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp
[composable_kernel] error: Illegal instruction detected: Operand has incorrect register class.
[composable_kernel] V_CMP_NE_U32_e32 0, $src_private_base, implicit-def $vcc, implicit $exec
[composable_kernel] 1 error generated when compiling for gfx1101.
```
This appears to be a compiler bug and we'll follow up to get a proper fix landed, but for the purposes of landing some work to enable gfx1151 support in TheRock we'd like to disable building of these kernels on this architecture temporarily.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant after this pull request
- [X] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
* Added device level implementation for bwd_data_wmma_v3.
* Added first instance of bwd_data_wmma_v3(f16).
* Add support for bwd data in gridwise implementation
Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight
* Initial device implementation of bwd data
* Remove unused template parameters in device impl
* Add one instance for different layout
initial check of device implementation
* Add tests for splitk and for different layouts
* Appended more instances to wmma_v3_f16.
* Added conv_2d bf16 wmma_v3 instances.
* Added conv_3d_bf16 wmma_v3_instances.
* Added conv_3d_f16_wmma_v3_instances.
* Added SplitN test cases for wmma.
* Conv3d_bwd_data_scale_wmma_v3 instances.
* Conv3d_bwd_data_bilinear_wmma_v3_instances
* Renaming the device level instances file to common name , since it is defined for different DataTypes.
* Renaming the instances and fixing typo
* Added the test cases to regression test list
* NCHW support for wmma_v3
* Examples for bf16 and f16 bwd_data_wmma_v3
* Added transpose conditons for device impl
* fixing bugs
* Added the gemm_args array implmentation
* WIP debug conv bwd
* fix splitk
* Grouped gemm fix
* Update CmakeLists with EOF
* Added more instances for tests
* Fixed the run time error in examples and removed 3d conv examples.
* Fixed a typo.
* Updated CmakeLists to removed the 3d convultion deleted files
* Added print error statements for unsupoorted argument
* Added the merge conflict related changes
* Fixed compilation error
* Fixed the InstanceFactory duplication error.
* Removed the print statements and added logs to Arg function
* All the merge conflict related errors resolved
* Added d_tensor tests.
* Added the missing example types of wmm_v3
* Merge error fix
* Corrected the instance name
* Reverted the bias relu change
* Revereted the transpose load local change
* Updated the regression test list with bwd_data_scale
* Revert "Revereted the transpose load local change"
This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.
* Revert "Merge error fix"
This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.
* Reverting the local change
* Added merge error fix
* Build error fix due to merge conflicts
* Added bias_relu example for wmma_v3
* Modified the main method in dtensor tests
* Updated the dtensor tests to pick all the shapes
* Updated the dtensor test shapes.
* Updated the mem operations in tests.
* Added reference func
* Fixed typos in device impl
* Added new header file and modified the include file for 3d tests
* Renamed the test file and added reference func call.
* clang format fix
* Added ignore params
* Modified device impl and tests
* Removed debug print statements and updated dtensor test shapes
* Fixing merge conflicts
* Fixing more merge conflicts
* Fixed copyrights
* Updated the tuned instances to bilinear and scale.
* Adding tuned instances to vanilla wmma_v3
* Removed all unused instances and modified test layouts.
* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.
* Fix clang format
* Updated tuned f16/-genric instances
* Formatting the instances file
* Fixed copyrights and clang issues
* Nonsense commit to force git to force
* Removed the transpose instances
* Added verified genric instances
* Fixing namespace errors
* Added todo for failing shapes
* Formatting instance file
* Fix instance list formatting
* Removing unnecessary formats
* Renamed the common file
* Unification of xdl and wmma bwd_data tests
* Updated Cmake
* Added all layout types and deleted code.
* Updated Cmake to add the condition to all tests.
---------
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.
* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* multi_abd wmma support:
- Add multiple A and B support to multiple D implementation (gridwise level)
- Add multi_abd GEMM (device level)
- Add instances (xdl parity)
- Add tests (both xdl and wmma)
- Add examples
- Add ckProfiler support (both xdl and wmma)
* Fix bug in device print function
* Fix unused template parameter
* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.
* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.
* Make relevant profilers print the number of valid instances to aid testing.
* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.
* Reset output buffer after each run in profile_grouped_conv_fwd_impl().
* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.
* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.
* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.
* Add int8 instances for 2D vanilla grouped conv fwd all layouts.
* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd
* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.
* Implement multi-AB support for grouped conv fwd and add example.
* Add 1D instances
* Add D layout tests to IsSupportedArgument()
* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.
* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.
* Add some more tests for vanilla grouped conv fwd
* Add 2D bias clamp instances and tests
* Add 3D bias clamp instances and tests
* Add 2D and 3D clamp instances and tests
* Unify problem sizes across vanilla and clamp flavor tests
* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.
* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.
* Remove wmma fp8 and bf8 instances when not targetting gfx12
* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build
* Remove old years for newly created files.
* No need to time kernels for now.
* Fixup comments
* Pass struct args to Gridwise Run() function by reference.
* Don't use workspace memory in the case where A needs explicit transposition but B does not.
* Move calculation of rotating memory buffer sizes to Argument member functions.
* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.
* Unify xdl and wmma example code for grouped conv fwd scaleadd ab
* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.
* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.
* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.
* Small post-merge fixup, everything seems to work.
* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.
* Extend scaleadd_ab instance lists
* Extend merged groups instance lists, including adaptations of xdl "2x" instances.
* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.
* Extend "mem" instance lists.
* Extend regular instance lists.
* Fixup comments and ignored kernel arg name
* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.
* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.
* Re-enable sharding for wmma cshufflev3 instances
* Post merge fix to vanilla test
* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.
* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.
* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.
* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.
* Re-enable old wmma instances
* Re-enable Linqun's Xdl Wmma instances
* Small post-merge fixes
* Fix copyright headers
* Remove commented code snippet in gridwise
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.
* Adding tuned instace list for groupoed conv fwd (#3288)
Following flavors are updated with tuned instance list:
- grouped_conv2d_fwd
- grouped_conv2d_fwd_bias_clamp
- grouped_conv2d_fwd_clamp
- grouped_conv3d_fwd
- grouped_conv3d_fwd_bias_clamp
- grouped_conv3d_fwd_clamp
- grouped_conv3d_fwd_scaleadd_ab
Re-factored instance selection:
- removed all the unnecessary instance tuples (comp/mem/16x16/generic)
- removed all unnecessary layouts and data types
* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.
* Splitting grouped conv fwd instances (#3449)
* Disable unnecessary and failing tests related to experimental CK builder
* Disable unnecessary ck builder experimental tests fully
* Adding extra flavors for grouped conv fwd
As titled. Following variants are added:
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_scale
* fix cmake error
* Fix failing int8 test for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>