[CK Tile] Add gemm universal preshuffle to MX GEMM (#5647)
## Motivation
Add gemm universal preshuffle support to existing MX GEMM pipeline.
The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.
## Technical Details
The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.
## Test Plan
Add `preshuffle` option to example: `tile_example_mx_gemm`.
Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.
## Test Result
Example and tests were successful on `gf950` architecture in the `Alola`
cluster.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)
## Motivation
Integrate Eight Waves pipeline in MX GEMM
## Technical Details
- EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
- Extend support of EightWaves policy for FP4 (packed types)
- Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
- CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
- Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
- Add configuration for EightWaves pipeline
- Tests:
- Unify existing FP8 and FP4 tests
- Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile
Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR
## Test Plan
Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Changed the include order of the new WMMA/MFMA unification framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.
## Motivation
Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.
Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.
## Test Plan
The CI (that runs `remod.py`) should compile.
## Test Result
Existing CI should compile and be green.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)
## Motivation
Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>