[CK] Optimize multi-dimensional static for loop decomposition
(#4447)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Recursive template implementations might initially seem attractive to
minimize necessary coding.
Unfortunately, this style is often affects readability and requires
significant resources from the compiler to generate instantiation
chains. In "high-traffic" code (e.g., used in many places + compilation
units), this generally does not scale well and can bloat the overall
compile times to unnecessary lengths.
The aim of this PR is to take some of most high-traffic utility code and
try our best to eliminate recursive templates in favor of fold
expansions and constexpr function helpers.
In local tests with clang build analyzer,
device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp
showed high hit-rates on slow template instantiations in static_for,
dimensional static_for (static_ford), which are subsequently affected by
implementation of the Sequence class and associated transforms.
Example:
**** Templates that took longest to instantiate:
70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1... (372 times, avg 188 ms) // **70 seconds!**
The above is part of the implementation of static_for which uses
Sequence classes..
## Technical Details
### Summary of Optimization Techniques
| Technique | Used In | Benefit |
|-----------|---------|---------|
| __Constexpr for-loop computation__ | sequence_reverse_inclusive_scan,
sequence_map_inverse | Moves O(N) work from template instantiation to
constexpr evaluation |
| __Pack expansion with indexing__ | sequence_reverse, Sequence::Modify
| Single template instantiation instead of recursive |
| __Flat iteration + decomposition__ | ford, static_ford | O(1) template
depth instead of O(N^D) |
| __Pre-computed strides__ | index_decomposer | Enables O(1)
linear-to-multi-index conversion |
### Impact on Compile Time
These optimizations reduce template instantiation depth from O(N) or
O(N^D) to O(1), which:
1. Reduces compiler memory usage
2. Reduces compile time exponentially for deep instantiation chains
3. Enables larger iteration spaces without hitting template depth limits
## Test Plan
* Existing tests for Sequence are re-used to affirm correctness
* Unit tests for ford and static_ford are added (dimensional looping)
* 8 new regression tests specifically verify the fixes for the PR
feedback:
- `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford
- `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford
- `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford
consistency
- `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford
- `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford
- `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford
- `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford
- `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with
non-trivial ordering
## Test Result
### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized)
#### Commits in Range (8 commits)
1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and
sequence_reverse
2. `7a7e3fdeef` - Optimize sequence_map_inverse
3. `92855c9913` - Optimize ford and static_ford calls to eliminate
nested template recursion
4. `88a564032b` - Add unit tests for ford and static_ford
5. `1a0fb22217` - Fix clang-format
6. `8a0d26bddf` - Increase template recursion depth to 1024
7. `dc53bb6e20` - Address copilot feedback and add regression tests
8. `477e06861d` - Increase bracket depth to 1024
#### Build Timing Results
| File | Base (8b72bc8759d9 | HEAD(a0438bd398) | Improvement |
|------|------|------|-------------|
| grouped_conv2d_fwd (f16) -j1 | 313.31s | 272.93s | __12.9% faster__ |
| grouped_conv1d_fwd (bf16) -j1 | 79.33s | 68.61s | __13.5% faster__ |
| grouped_conv1d_bwd_weight (f16) -j1| 15.77s | 14.31s | __9.2% faster__
|
| device_grouped_conv2d_fwd_instance -j64 | s | s | __% faster__ |
#### Key Optimizations
1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1)
template depth
2. __sequence_map_inverse__: O(N) → O(1) template depth
3. __ford/static_ford__: O(N^D) → O(1) template depth using flat
iteration with index decomposition
4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial
orderings
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Revert "[CK Conv] Add bwd weight instance for large-k shape"
(#4506)
Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be
investigated by @johannes-graner
[CK] Optimize vector type build times
**Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI
issues on import**
## Proposed changes
Build times can be affected by many different things and is highly
attributed to the way we write and use the code. Two critical areas of
the builds are **frontend parsing** and **backend codegen and
compilation**.
### Frontend Parsing
The length of the code, the include header tree and macro expansions all
affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp
vector_type class by reducing redundant code by generalization.
* Partial specializations of vector_type for native and non-native
datatypes have been generalized to one single class, consolidating all
of the data initialization and AsType casting requirements into one
place.
* The class nnvb_data_t_selector (e.g., Non-native vector base dataT
selector) class has been removed and replaced with scalar_type
instantiations as they have the same purpose. Scalar type class' purpose
is already to map generalized datatypes to native types compatible with
ext_vector_t.
### Backend Codegen
Template instantiation behavior can also affect build times. Recursive
instantiations are very slow versus concrete instantiations. The
compiler must make multiple passes to expand template instantiations so
we need to be careful about how they are used.
* Previous vector_type classes declared a union storage class, which
aliases StaticallyIndexedArray<T,N>.
```
template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
using d1_t = T;
typedef T d2_t __attribute__((ext_vector_type(2)));
typedef T d4_t __attribute__((ext_vector_type(4)));
using type = d4_t;
union
{
d4_t d4_;
StaticallyIndexedArray<d1_t, 4> d1x4_;
StaticallyIndexedArray<d2_t, 2> d2x2_;
StaticallyIndexedArray<d4_t, 1> d4x1_;
} data_;
...
};
```
* Upon further inspection, StaticallyIndexedArray is built on-top of a
recursive Tuple concatenation.
```
template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
using type =
typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};
```
This union storage has been removed from the vector_type storage class.
* Further references to StaticallyIndexedArray have been replaced with
StaticallyIndexedArray_v2, which is a concrete implementation using
C-style arrays.
```
template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
...
T data_[N];
};
```
### Fixes
* Using bool datatype with vector_type was previously error prone. Bool,
as a native datatype would be stored into bool ext_vector_type(N) for
storage, which is a packed datatype. Meaning that for example,
sizeof(bool ext_vector_type(4)) == 1, which does not equal
sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The
union of these datatypes has incorrect data slicing, meaning that the
bits location of the packed bool do not match with the
StaticallyIndexedArray member. As such, vector_type will use C-Style
array storage for bool type instead of ext_vector_type.
```
template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));
sizeof(NativeVectorT<bool, 4>) == 1 (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0
sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4 (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3
union{
NativeVectorT<bool, 4> d1_t;
...
StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};
// union size == 4 which means invalid slicing!
```
* Math utilities such as next_power_of_two addressed for invalid cases
of X < 2
* Remove redundant implementation of next_pow2
### Additions
* integer_log2_floor to math.hpp
* is_power_of_two_integer to math.hpp
### Build Time Analysis
Machine: banff-cyxtera-s78-2
Target: gfx942
| Build Target | Threads | Frontend Parse Time (s) | Backend Codegen
Time (s) | TotalTime (s) | commitId |
|---------------|---------|-------------------------|--------------------------|---------------|
[CK][CK TILE] Add has hot loop check for pipeline v1
## Motivation
Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.
## Technical Details
Hot loop has been executed without num loop check.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
Passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
[Conv] Add NumGroupsToMerge to BwdWeight type string
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Add parameter to bwd weight V3 type string showing the number of groups
to merge. This is required for MIOpen to be properly tuned since it uses
type strings for performance database entries.
In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK TILE] fix numerical errors of preshuffle_b
This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.
# The numerical issue
## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```
# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.
**Key changes include:**
### Data layout and shuffling logic
* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)
### Test infrastructure enhancements
* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Add padding to cshuffle epilogue to avoid bank conflict
(#4274)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Added the padding to CShuffle Epilogue to avoid the bank conflicts of
64. Synced up and learned from the internal repo.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK][CK Tile] Temporary disable grouped conv fwd tile comp
async instances (#4457)
## Motivation
[CK][CK Tile] Temporary disable grouped conv fwd tile comp async
instances due to the failures
## Technical Details
disable configs to not comple these instances
## Test Plan
test_grouped_convnd_fwd_Tile
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[Azure External CI] Disable Azure CI on rocm-libraries
(#4460)
- Deleting all pipeline trigger files tied to Azure External CI from
top-level and project-level.
Add a README.md file to ck/library/util
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
I'm collecting information about our current testing (#3664). As part of
this work I a README to the directory to emphasize the GPU-first testing
strategy and our support for type-specific tolerances.
This readme contains internal code comments for CK developers and does
not need ROCm documentation review.
Adapt parser to monorepo
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Addressing issues found trying to run the dependency parser on MIOpen:
- Ninja is recording the full path, e.g.: [json]
```
"file_to_executables": {
"/home/rspauldi/repos/rocm-libraries/projects/miopen/include/miopen/miopen.h": [
```
- Running git in monorepo reports the full _relative_ path, e.g.:
```
"projects/miopen/include/miopen/miopen.h"
```
Of course, `git diff` also returns all files modified in every other
project's commits. These are filtered out as early as possible.
This solution searches for `rocm-libraries` in the `parsing` step, and
if found extracts the project name and stores it in
`enhanced_dependency_mapping.json`. Leading folders are truncated from
each file path, up to and including the project name. This allows
`_is_project_file` to remain unchanged.
The `selection` step then retrieves the project name from the json if it
is defined, and truncates the project folder from the `git diff` output
so the filenames exactly match the json entries.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
Successfully runs on rocm-libraries MIOpen PRs and produces a list of
tests. I haven't verified the results yet.
This version is not applicable to CI since it operates on a
per-executable level and MIOpen CI uses the single gtest binary. I'll be
working towards that in future PRs over the next few weeks.
```
/home/rspauldi/repos/rocm-libraries/projects/miopen# git checkout miopen/sgundabo_enable_ck_bwd_wrw_navi
<run CMake with TEST_DISCRETE=ON>
# ninja tests
# root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py parse build/build.ninja
Parsing ninja dependencies from: build/build.ninja
Parsing ninja build file...
Found 312 executables
Found 820 object-to-source mappings
Found 820 object files
Extracting detailed dependencies for all object files...
Processed 100/820 object files...
Processed 200/820 object files...
Processed 300/820 object files...
Processed 400/820 object files...
Processed 500/820 object files...
Processed 600/820 object files...
Processed 700/820 object files...
Processed 800/820 object files...
Completed dependency extraction for 820 object files
Building file-to-executable mapping...
Found rocm-libraries project: 'miopen'
Built mapping for 608 files
Files used by multiple executables: 216
Sample files with multiple dependencies:
build/include/miopen/config.h: 306 executables
build/include/miopen/export.h: 306 executables
build/include/miopen/export_internals.h: 304 executables
driver/InputFlags.hpp: 2 executables
driver/driver.hpp: 2 executables
=== Enhanced Dependency Mapping Summary ===
Total executables: 312
Total files mapped: 608
Total object files processed: 820
File types:
.cpp files: 310
.hpp files: 292
.h files: 6
Files used by multiple executables: 216
Top files with most dependencies:
build/include/miopen/config.h: 306 executables
build/include/miopen/export.h: 306 executables
include/miopen/miopen.h: 304 executables
src/include/miopen/config.hpp: 304 executables
build/include/miopen/export_internals.h: 304 executables
src/include/miopen/rank.hpp: 303 executables
src/include/miopen/errors.hpp: 302 executables
src/include/miopen/object.hpp: 302 executables
src/include/miopen/returns.hpp: 302 executables
src/include/miopen/sysinfo_utils.hpp: 302 executables
Exporting mapping to build/enhanced_file_executable_mapping.csv
Exporting complete mapping to build/enhanced_dependency_mapping.json
Results exported to:
CSV: build/enhanced_file_executable_mapping.csv
JSON: build/enhanced_dependency_mapping.json
root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py select build/enhanced_dependency_mapping.json 1b13d8b72d54e34bdc7ae70dd2b6e809dca8b10e 09e5965d55ebbfacfd1ed18e5092580c2ffae748
Identified 30 files modified in project 'miopen'
Exported 304 tests to run to tests_to_run.json
```
I don't know if clang-format applies to scripts. If so, could someone
show me how to run it in CK?
[CK Conv] Add bwd weight instance for large-k shape
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
This instance improves the shape used in `./bin/ckProfiler
grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1
1 1 1 1 all` from 10.3 ms to 6.6 ms.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant
pipeline (#4268)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation
## Motivation
This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers
to compile across multiple GPU architectures by introducing compile-time
availability gating and a new attribute tag mechanism for kernel
symbol/attribute specialization.
## Technical Details
- Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline
and propagate availability handling into `QuantGemmKernel`.
- Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling
per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and
unique symbols.
- Update the blockscale GEMM quant example to pass kernel attributes and
adjust gfx950 gating.
## Test Plan
- CI
- Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950'
-GNinja && ninja tile_example_gemm_quant`
- Local test with ROCm/aiter#1954
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tools] Auto-enable unbuffered output for Python commands
(#4265)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
ck-docker exec and ck-exec now automatically detect Python commands and
set PYTHONUNBUFFERED=1 to enable live output streaming. This eliminates
the need to manually set the environment variable when running Python
scripts that print progress updates.
The detection matches python, python3, or any .py file argument.
This helps in watching live terminal output when a python script is
running inside the container.
[CK] CK Tile grouped convolution direct load
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
Enable group mode (varlen) kernel generation for PyTorch
integration (#4292)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
This PR enables group mode (variable-length attention) kernel generation
for PyTorch's CK SDPA backend.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
The change is minimal (single line deletion) but enables a significant
feature: variable-length attention support for ROCm users via PyTorch's
torch.nn.attention.varlen API.
[CK] Fix grouped conv fwd transform for merged groups
## Motivation
[CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.
## Technical Details
After optimizations for 2d there is a lack of implementation for 1d and
3d
## Test Plan
test_grouped_convnd_fwd
## Test Result
pending CI
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] MICI: Disable failure pattern checking
## Motivation
- ck mici jobs hanging at end, possibly at failure pattern checking
## Technical Details
- Disable failure pattern checking to see if hanging goes away
## Test Plan
- Observe behavior after merge
[CK_TILE] Add blockscale GEMM support for EightWarps on
gfx950 (#4280)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
gemm blockscale eightwarps support
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK_TILE] Add support and tests for V6 pipeline in conv fwd
(#4357)
Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
[CK_TILE] Fix MMA concepts compiler error
## Motivation
CK Tile is required to support certain older OSs; on these OSs, cpp 20
is not fully supported. For ROCm 7.2, compiler errors occur on one of
these older OSs. An example of this error is as follows:
```bash
/composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
|
```
The goal of this PR is to resolve these compiler errors.
## Technical Details
The existing guards around the mma concepts only check if the concepts
language feature is supported, as follows:
```cpp
#if defined(__cpp_concepts) && __cpp_concepts >= 201907L
// ...
template <typename CtrlFlags>
concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
// Flag members for Gfx9 MFMA instructions
{ CtrlFlags::Cbsz } -> std::convertible_to<int>;
{ CtrlFlags::Abid } -> std::convertible_to<int>;
{ CtrlFlags::Blgp } -> std::convertible_to<int>;
};
#endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
```
That said, in cases where functionality from the `<concepts>` header is
used (e.g., `std::convertible_to`), this guard fails to check whether
the `<concepts>` header is available.
This change adds an additional check to the concepts that make use of
functionality from the `<concepts>` header to ensure the header is
available.
## Test Plan
I tested the changes on the relevant docker for gfx90a, gfx950, and
gfx942 and the compiler issue is not present.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
=?UTF-8?q?Increase=20tolerance=20for=20FP16=20GEMM=20test?=
=?UTF-8?q?s=20to=20handle=20non-deterministic=20ro=E2=80=A6=20(#4335)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
…unding
Three tests were failing intermittently with small errors (0.01-1.5%)
due to non-deterministic FP16 accumulation order from GPU thread
scheduling:
- test_ck_tile_batched_gemm
- test_ck_tile_grouped_gemm_preshuffle
- test_ck_tile_grouped_gemm_multi_d
These tests use kbatch=1 (no split-K), so errors are from
order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
2e-3 (0.2%) to account for FP16 precision limits while still catching
real bugs.
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK] Workaround blockscale wp test failure
## Motivation
Workaround to fix blockscale wp test failure for pipeline v3
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] fix path for build filter
## Motivation
Fix the filter that determines whether CI builds are necessary.
## Technical Details
A script checks the files list returned by git diff and checks whether
any code source was modified. If not, if only documentation was changed,
it will allow skipping the builds. We make sure we only look at the
changes in projects/composablekernel/ folder.
[CK] a bunch of CI fixes.
## Motivation
Fixing some of the CK CI issues
## Technical Details
fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
[CK] MICI: Fix git diff in selective_test_filter.py
## Motivation
- git diff needs access to reference repo
## Technical Details
- mount reference repo path into docker for selective_test_filter.py to
access
## Test Plan
- tested in MICI
## Test Result
- launch_tests.sh ran successfully
[composablekernel] fix failure status
## Motivation
Pipelines were failing on Math CI status check.
## Technical Details
For the success case, we just changed the config in Jenkins to use a
proper app token and no code changes were required. However, the failure
case would not have worked as coded, so we needed to move that outside
of the `rocmnode()` block.
## Test Plan
I removed all of the CI in one of the commits to quickly test, and then
added it back. Got a successful "success" message and "failure" message
produced
[CK] MICI: Correct path for build trace script
## Motivation
- Corrects path to script due to superrepo migration
- Forces all tests to run by default
## Technical Details
- now in /projects/composablekernel
[CK] MICI: Use reference repo for checkout operations
## Motivation
- Maintain a reference repo on slave nodes that speeds up any
clone/checkout operations
## Technical Details
- clone a ref repo if it does not exist
- update ref repo if it does exist
- checkout after ref repo is updated
- eliminates double clone
## Test Result
- Initial checkouts succeeded
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Implement per-page K/V quantization for paged attention:
- Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
- Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations
## Proposed changes
Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
CK CI migration.
## Motivation
Enable the CK CI after migration from standalone repo.
## Technical Details
Modify the jenkinsfile in projects/composablekernel to update the CI
workflow.
## Test Plan
This is for CK internal testing only.
## Test Result
Set up new CK CI pipeline/dashboard.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* Added two new failure patterns to detect. Including test function to verify if the patterns are detected
* Modifying pattern match to detect docker login failure. Removed passing tests.
* Removing passing tests. Modifying docker pattern to detect failure
* Removed passing tests
* Removing test logging function
* Stream-K smoke test config file generation
This change converts the stream-k smoke tests to use tile engine. Since
the m, n, and k values dependent on the CU count of a device, the
configs are generated during the Configuration Phase.
* Compute GEMM reference on GPU
* Remove redundant Stream-K tests
Removing redundant tests that are now run via tile engine.
* Fix relative and absolute tolerance calculation
This change updates the Stream-K tile engine interface to ensure that
num_wgs_per_tile is propaged and passed into the compare_results
function to calculate the rel and abs tolerance. Before, split-k was
used, which is incorrect for Stream-K since the split-k value is
always 1.
* Cleanup imports, types, and other misc items
This commit makes the following changes:
- Uses Typing module for nested type hints
- Uses quotes around cu_count_arg argument in generate_configs.cmake in
if statements
- Adds explicit include for tuple in test_gemm_streamk_simple.cpp
- Adds a type for the tiles argument in argparser to check argument
validity
* Use CU count as return value for better parsing
* Add reduction tests for bf16, fp8, and bf8
Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.
Co-authored-by: Claude <noreply@anthropic.com>
* WIP: add splitk to bquant
* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types
* chore: remove temporary test script
* fix: incorrect tile window length for splitted bq tensor window
* chore: improve comments
* test: add unit tests to cover bquant splitk functionality
* fix: conflict resolution by renaming variables
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Change call to the intrinsics
* fix clang format
* Undo changes under include/ck/utility
* Use named variable as vector size
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>