[CK][CK Tile] Fix kbatch check in grouped conv and gemm
kernels (#5555)
## Motivation
Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.
## Technical Details
Round up K / Kperxdl and divide it by Kbatch to allow tail for K.
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Ck/joye/revert oob check
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Rename Stream-K grid function
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().
## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.
## Test Plan
Validated using the test suite that is already present.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Enable MXFP6 for MX GEMM op
## Motivation
Add support for MXFP6 in the MX GEMM op in CK-Tile.
Depends on https://github.com/ROCm/rocm-libraries/pull/4594
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] add tf32 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.
## Checklist
Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run on all changed files
- [ ] Any dependent changes have been merged
## Discussion
CK Tile MX GEMM Packing Improvement
## Motivation
Reduce the scale loading size and also has better utilization of MFMA
scale selection.
## Technical Details
Add up the packing of mx scales.
## Test Plan
Use the existing test cases.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Eight Waves pipeline GEMM
## Motivation
Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM
## Technical Details
Summary:
- Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
- Use this block struct in ABQuant for encodings
- Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
- Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves
So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts
## Test Plan
Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).
Note: K padding test is disabled for this pipeline because it's not
implemented yet
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are
zero. (#5050)
## Motivation
It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.
## Technical Details
Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.
## Test Plan
Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.
## Test Result
All tests pass.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Changed the include order of the new WMMA/MFMA unification
framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.
## Motivation
Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.
Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.
## Test Plan
The CI (that runs `remod.py`) should compile.
## Test Result
Existing CI should compile and be green.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
gfx950 (#4368)
## Motivation
Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline
## Technical Details
The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.
Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
* both 32x32x64 and 16x16x128 scale MFMAs are supported
* Q and K scales are applied in hdim, V scales - in seqlen dimension
* column-major V only
* batch and group mode
* bias, Alibi (tested but no instances by default, just like fp8)
* masking etc.
Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008
## Test Plan
```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```
## Test Result
The tests must pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] MX GEMM non-preshuffled RCR layout
## Motivation
Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification of sparse MFMA/WMMA policy structs
(#4837)
## Motivation
The existing unification work supports DENSE intrinsics. In this PR we
enable support for SPARSE as well as SCALE intrinsics and add an example
SPARSE implementation.
## Technical Details
Mostly trivial changes. One framework change is that the desired
`MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant
commit explains, we do not support a fallback family at the moment, but
it is something we can consider.
## Test Plan
Added a new test for the relevant sparse specializations.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add CK Tile bwd weight profiler
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.
## Technical Details
- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
[CK_TILE] Reduce Register Spills in Stream-K Reductions
(#4984)
## Motivation
In CK Tile Stream-K, kernels using one of two non-atomic reduction
strategies (i.e., linear, tree) have high register spill count, with the
tree reduction generally being worse. These changes act a first step to
help decrease the register spill count.
## Technical Details
### Problem 1: Unvectorized access to partials
In both the linear and tree reductions, workgroups write partials
results to a global buffer; another workgroup will later read this data.
When the initial logic to support reading and writing to the partials
buffer was added (see
https://github.com/ROCm/composable_kernel/pull/3107), the tile
distribution encoding used to read from and write to partials matches
the register layout for the accumulator of the mfma instruction used for
the kernel. Since we do not currently use the transposed register layout
for the accumulator, we end with an encoding that is not optimized for
writing to HBM.
For example: Consider the register layout of the
`v_mfma_f32_16x16x32_fp8_fp8` instruction.
```bash
./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix
```
<img width="1113" height="537" alt="image"
src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02"
/>
The above shows that threads are responsible for consecutive elements
down a column of the C tile. If we use this distribution to read and
write to partials with C in row major, then threads are unable to
perform vectorized reads and writes. Note: thread 0 is shown in red and
thread 1 is shown in green.
Since the C-shuffle Epilogue only supports C in row major, reading and
writing to partials is highly unoptimized.
### Problem 2: Missed opportunity for SPGR use in tree reduction loop
Since the reduction occurs between workgroups, all threads in the
workgroup follow the same execution paths in the tree reduction logic,
hence various variables should be using SGPRs, but they are not.
### Implemented Solutions
1. Add a new tile distribution encoding that is optimized for accessing
partials in HBM. This encoding does not change the data assignment to
threads, it merely changes the addresses to which they write/read in the
partials buffer. For example, continuing with the
`v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result
in threads writing in the following layout:
<img width="517" height="342" alt="image"
src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202"
/>
This layout ensures that each thread writes along a row, enabling
`buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses).
This helps reduce register usage due to requiring fewer offset
calculations.
2. To force SGPR usage in the tree reduction loop, I make use of CK
Tile's `amd_wave_read_first_lane` which is a wrapper around
`__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the
tree reduction.
_These changes do not fully eliminate register spills. Future work will
aim to further reduce spills. But these changes make good progress._
## Test Plan
Added tests for different warp tile sizes to validate that the new
encoding works with different `WarpGemm` variants.
## Test Result
All tests pass locally on all gfx9 architectures.
Some results for decreases in register spills on gfx942: (BL = baseline)
| Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % |
VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % |
|--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:|
| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 |
-100.0% | 21 | 20 | -1 | -4.8% |
| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3%
| 443 | 23 | -420 | -94.8% |
| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6%
| 12 | 6 | -6 | -50.0% |
| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9%
| 396 | 12 | -384 | -97.0% |
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Cleanup and refactoring related to tile loading
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Cleanup and refactoring done while implementing mixed precision for
fp16/bf16 x fp8
Key changes:
- Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and
refactored the API to use consistent naming conventions
- Updated load_tile_transpose functions to use output parameters instead
of return values for consistency
- Removed unused variable declarations and simplified type deduction
logic
- Define load_tile_with_elementwise to use tuple types explicitly for
clarity
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK][CK Tile] Fix batched gemm kernel 2 lds
## Motivation
Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.
## Technical Details
Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.
## Test Plan
CI overall
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Tile Engine support for gfx950
## Motivation
This PR adds support for the gfx950 GPU architecture to the Tile Engine
in Composable Kernel library, focusing on GEMM operations with FP8 and
BF8 data types.
## Technical Details
Added gfx950-specific MFMA warp GEMM implementations with conditional
compilation.
Updated default GEMM configuration parameters for tile sizes and warp
configurations.
Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950
hardware.
## Test Plan
Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.
## Test Result
Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK TILE] Improve oob check
## Motivation
Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.
Change number of KPack for generated instances
## Technical Details
Remove permute instructions from assembly
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to
allow caller to pass desciptors directly (#4295)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create
tensor views from strides and sizes. This refactors the descriptor
creation out and add overloaded definitions, allowing descriptors to be
created separately by the caller instead of passing explicit strides,
with no functional changes.
This will enable further refactoring of `RunGemm` to do likewise,
enabling derived kernels like BatchedContractionKernel to avoid creating
separate versions (PR
[#3457](https://github.com/ROCm/composable_kernel/pull/3457)).
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
Since the logic within the MakeXBlockviews chains together operations on
tuples, and thus the descriptors are also passed as such, adding a
template parameter for the type of the input tuple was the simplest
option to enable the overload without too much verbiage. However, for
`MakeCBlockView` this adds a complications as the templated definitions
are prone to overlap. This for now is avoided by just moving the
arguments around for the descriptor version, which avoids the collision.
It's not a great solution, so feel free to suggest a better one.
[CK_TILE] Extend support of mix precision microscaling BQuant
(#4267)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Supported types combinations using BQuant=e8m0:
- A=bf16
- B=bf16,bf8,fp4
Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
- add support for scaling instructions in `DequantPack8`
- mx pipeline:
- extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
- block gemm:
- mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
- warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK_TILE] Update Stream-K Reduction Strategy Enum
## Motivation
Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.
## Technical Details
Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Reduction = 1u,
TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Linear = 1u,
Tree = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan
No new functionality was added, so no new tests were added; I just
validated existing tests and examples.
## Test Result
All tests passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Support gfx11
## Motivation
Add support of gfx11 architectures (RDNA3) to FMHA.
## Technical Details
Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).
Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.
Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.
## Test Plan
```
ninja test_ck_tile_fmha
bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```
## Test Result
All tests must pass (some tests may be skipped).
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK TILE] Add has hot loop check for pipeline v1
## Motivation
Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.
## Technical Details
Hot loop has been executed without num loop check.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
Passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
[CK TILE] fix numerical errors of preshuffle_b
This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.
# The numerical issue
## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```
# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.
**Key changes include:**
### Data layout and shuffling logic
* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)
### Test infrastructure enhancements
* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK] CK Tile grouped convolution direct load
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
[CK_TILE] Add blockscale GEMM support for EightWarps on
gfx950 (#4280)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
gemm blockscale eightwarps support
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK_TILE] Add support and tests for V6 pipeline in conv fwd
(#4357)
Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Change call to the intrinsics
* fix clang format
* Undo changes under include/ck/utility
* Use named variable as vector size
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Fix alignment issue in Stream-K workspace buffer
In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.
This change also re-enables the test_ck_tile_streamk_reduction tests.
* Compute reference GEMM on GPU for test verification to decrease testing time
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
* refactor: remove Default scheduler implementation as it not used anymore
* refactor: remove dead code from gemm universal kernel
* chore: add descriptive comments about amd intrinsic hardware sync instructions
* fix: label existing memory pipeline for aquant as intrawave
* [CK_TILE] unify double and single lds implementation (#108)
Unify LDS buffer management API for single and double buffering modes
This change consolidates the Local Data Store (LDS) buffer management by:
Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* update wp_pipeline
* fix a c++17 issue
* update for ci errors
* fix ci issues
* include a header to fix ci errors
* fix some rebase issues
* update with rebase
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
* fix for splitk if splitk < grid
* add different splitk implementation
* minor bugfix for streamk gemm
* Add test
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* CK Tile Stream-K Tree Reduction
This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.
Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.
Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.
* Move Stream-K kernel files to a subdirectory
* Cleanup Code Style & Handle Unsupported Reductions
This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable
* Run new copyright script on new files