[CK Tile] Async support pipeline V3
## Motivation
Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)
## Technical Details
- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type
## Test Plan
New test `test_ck_tile_gemm_pipeline_compv3_async`
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] EightWaves pipeline int8 support
## Motivation
EightWaves pipeline currently is supporting only FP types
## Technical Details
- Enable 16x16x64 int8 instruction for gfx950 in dispatcher
- Enable int8 in EightWaves pipeline
- Add tests
- Fix bug in `warp_gemm_attribute_mfma_impl.hpp`
## Test Plan
Tests have been added for int8 GEMM using EightWaves pipeline
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to
MX GEMM TDM pipeline (#8202)
Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM
pipeline, building on the scale16 warp-gemm layer already in develop.
- **warp gemm:** add the 32x32x128 f8f6f4 scale16 traits and alias (2x2
grid of 16x16x128 scale16 intrinsic calls with per-subtile
`SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's
`IsScale16` path.
- **default policy:** select the warp gemm via the dispatcher with
`IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each
pick the matching scale16 path; guard WarpTile M/N to 16 or 32;
scale-tile distribution for the scale16 layout.
- **pipeline V1/V2:** thread `Problem::ScaleBlockSize` through the
scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize`
for the kernel.
- **block gemm:** extract int64 (scale16) / int32 (scale32) scales by
width.
- **kernel:** scale16 descriptor order; reject unsupported
`BlockScaleSize`.
Test coverage for this path is in the stacked follow-up PR.
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
=?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR adds two specialisations related to `tf32_t`.
## Technical Details
This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.
## Test Plan
All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Async support preshuffle GEMM
## Motivation
Add async support to existing preshuffle GEMM pipeline
## Technical Details
Notes:
the implementation avoids previous strategy of duplicating pipelines for
async support and instead add a switch `Async` to the ops Problem to
enable async pipeline. Then, integrate the async pipeline in the
existing one. This allows to avoid code duplication and facilitate the
integration of buffer load to lds in existing pipelines. In my opinion,
it should be used also for other pipelines which don't support buffer
load to lds yet and it would also be a good idea to refactor the
existing async GEMM pipelines with the same approach.
Summary:
- integrate buffer load to lds in existing pipeline
- add optimal tensor descriptors for vmem loading and lds reading. They
are currently optimized for 16x16 wave tiles but they also work for
32x32 wave tiles. Optimizations for 32x32 wave tile requires different
lds layout and it will be done in a follow-up issue
- Add async config to examples
- Add test (gfx950 only)
## Test Plan
New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async`
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC
## Motivation
Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.
## Technical Details
The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.
## Test Plan
Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.
To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.
## Test Result
The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
[CK Tile] Add gemm universal preshuffle to MX GEMM (#5647)
## Motivation
Add gemm universal preshuffle support to existing MX GEMM pipeline.
The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.
## Technical Details
The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.
## Test Plan
Add `preshuffle` option to example: `tile_example_mx_gemm`.
Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.
## Test Result
Example and tests were successful on `gf950` architecture in the `Alola`
cluster.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)
## Motivation
Integrate Eight Waves pipeline in MX GEMM
## Technical Details
- EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
- Extend support of EightWaves policy for FP4 (packed types)
- Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
- CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
- Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
- Add configuration for EightWaves pipeline
- Tests:
- Unify existing FP8 and FP4 tests
- Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile
Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR
## Test Plan
Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] add composable kernel support on gfx1250 (#6978)
## Motivation
Add composable kernel support on gfx1250.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209)
## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.
## Changes:
- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350
## Performance results
1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300
## Test/Verify
1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2. Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```
## Discussion
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450)
## Motivation
`qr_hpad` currently executes work for padded head-dim fragments even
when only a subset of the values are valid. This adds unnecessary
computation for head dimensions that require padding, such as `hdim=72`
and `hdim=80`, and hurts FMHA forward performance.
The goal of this PR is to make the padded-head-dim path skip invalid
work based on the actual valid fragment count, while preserving the
existing behavior for the non-padded path.
## Technical Details
This PR improves the `qr_hpad` FMHA forward path in three parts:
- Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part
of the fragment is valid.
- Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the
valid fragment range instead of always computing over the padded extent.
- Retune the gfx11 `qr_hpad` kernel configuration after enabling the
partial-fragment path.
To keep the existing path stable, the implementation adds overloads for
the updated GEMM/pipeline interfaces. This allows existing full-tile
callers to keep using the previous form, while the `qr_hpad` path can
pass valid fragment counts when needed.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- On gfx11 and gfx12, for head dimensions that require padding,
`tile_example_fmha_fwd` shows about 20-30% performance improvement at
`hdim=72/80`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile MX GEMM Packing Improvement (#5323)
## Motivation
Reduce the scale loading size and also has better utilization of MFMA
scale selection.
## Technical Details
Add up the packing of mx scales.
## Test Plan
Use the existing test cases.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
[CK Tile] Eight Waves pipeline GEMM (#4964)
## Motivation
Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM
## Technical Details
Summary:
- Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
- Use this block struct in ABQuant for encodings
- Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
- Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves
So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts
## Test Plan
Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).
Note: K padding test is disabled for this pipeline because it's not
implemented yet
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)
## Motivation
Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline
## Technical Details
The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.
Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
* both 32x32x64 and 16x16x128 scale MFMAs are supported
* Q and K scales are applied in hdim, V scales - in seqlen dimension
* column-major V only
* batch and group mode
* bias, Alibi (tested but no instances by default, just like fp8)
* masking etc.
Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008
## Test Plan
```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```
## Test Result
The tests must pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)
## Motivation
Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
[CK_TILE] Add CK Tile bwd weight profiler (#4797)
## Motivation
To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.
## Technical Details
- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
Cleanup and refactoring related to tile loading (#4294)
## Proposed changes
Cleanup and refactoring done while implementing mixed precision for
fp16/bf16 x fp8
Key changes:
- Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and
refactored the API to use consistent naming conventions
- Updated load_tile_transpose functions to use output parameters instead
of return values for consistency
- Removed unused variable declarations and simplified type deduction
logic
- Define load_tile_with_elementwise to use tuple types explicitly for
clarity
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3505](https://github.com/ROCm/composable_kernel/pull/3505)
🧑💻 Originally authored by @SamiAario-AMD
---------
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK_TILE] Extend support of mix precision microscaling BQuant (#4267)
## Proposed changes
Supported types combinations using BQuant=e8m0:
- A=bf16
- B=bf16,bf8,fp4
Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
- add support for scaling instructions in `DequantPack8`
- mx pipeline:
- extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
- block gemm:
- mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
- warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689)
🧑💻 Originally authored by @EnricoDeg
---------
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* refactor: remove Default scheduler implementation as it not used anymore
* refactor: remove dead code from gemm universal kernel
* chore: add descriptive comments about amd intrinsic hardware sync instructions
* fix: label existing memory pipeline for aquant as intrawave
* [CK_TILE] unify double and single lds implementation (#108)
Unify LDS buffer management API for single and double buffering modes
This change consolidates the Local Data Store (LDS) buffer management by:
Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* update wp_pipeline
* fix a c++17 issue
* update for ci errors
* fix ci issues
* include a header to fix ci errors
* fix some rebase issues
* update with rebase
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* rebase to new develop
* fix clang format
* update code according to reviewer's comment
* Update README.md
* update code according to reviewer's comment
* update code according to reviewer's comment
* Update CMakeLists.txt
* Update README.md
* Update CMakeLists.txt
* Delete files
* Delete files
* Add unit tests
* Update test_gemm_quant_base.hpp
* merge bf16*fp4 example to develop branch
* fix clang format
* fix clang format
* Update CMakeLists.txt
* fix ci test
* fix clang format
* resolve conflicts
---------
Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Add missing copyright statements
* Use ck_tile::host_tensor_descriptor instead of a custom lambda
* Refactor use of check_data_type in test classes
* Use TEST_SUITE_NAME with TYPED_TEST_SUITE
* Remove an unused namespace
* Make dim3 const
* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile
* Add missing precision type combinations to CompV4 from CompV3
* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma
* Add missing precision type combinations to CompV3Wmma from CompV3
* Remove the basic and universal tests and their dependencies
* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B
* Use ADataType and BDataType instead of ComputeDataType for WarpGemm
* Explicitly set some return types to void
* Use more general typenames in InterleavedPKTypeLoader
* Add load_interleaved_pk_type.hpp to common.hpp
* Use std::is_same_v in load_int4_tile
* Add handling of LoadTranspose to load_int4_tile
* Factor out common code in several places using load_int4_tile
* Add support for pk_int4_t using load_int4_tile
* Fix formatting
* HasHotLoop is a constexpr
* Remove an unused function
* Remove some unused include statements
* Add implementation and tests for fp8 x bf8 weight preshuffle GEMM
* Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs
* Remove two barrier calls that HotLoopScheduler already calls
* No need to suppress a variable that hasn't been declared
* Replace six arg_parser arguments with constexpr literals
* Simplify run_gemm_test_prec_type
* The strides don't need to be passed via arg_parser as we use their default values
* The layouts don't need to be passed as arguments twice
* Pass M N and K as regular arguments, not using the argument parser
* We can now remove the argument parser
* Add a common file for precision types to be used in testing
* Convert basic and universal GEMM tests to use gtest
* Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes
* Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on
* Added a changelog entry
* Add missing copyright statements
* ifndef-define-endif is not needed with pragma once
* Fix a comment
* Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp
* Disable the unreliable test MoeSortingCase4
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.
* Factor out the three separate copies of load_interleaved_pk_type into a common utility class
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example
* Remove a duplicate function
* Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included
* I4 support introduced more failing test cases that mirror the existing ones for F8
* Simplify the check for which tests to skip (they all have F8 as A tensor type)
* Add a changelog entry
* add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* docs(README): update readme with new build instructions
* feat(grouped_gemm): add support back for non persistent kernel
* refactor(grouped_gemm): simplify tensor creation
* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management
* chore(grouped_gemm): add print statements to ease debugging
* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration
* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity
* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines
* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle
* style: clang format work so far
* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline
* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors
* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm
* chore(grouped_gemm): add/remove debug comments and debug print statements
* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes
* chore(gemm_profile): add new argument combinations
* fix: branch cleanup, formatting, refactoring
* fix: branch cleanup, formatting, refactoring
* chore(changelog): update changelog to reflect new featuer
* address review comments & nit
* Use read_tr in universal gemm
* Enable all instances back
* Revert example37 changes
* Resolve comments
* resolve comments 2
* Fix assertion msg
* fix the gemm basic
* change index_t to bool for preshuffle variable
* Solve the comment
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
* Initial commit
* Adding new tile partitioner to flatmm
* intermediate changes
* debugging kernels
* Updating flatmm example to universal gemm example
* updated flatmm kernel to run via gemmKernel
* update universal gemm to incorporate flatmm
* debug
* Fix flatmm call
* Fixing other kernels and tests for API changes
* clang formatted
* fixing gemm tests
* added test for flatmm and simplify kernel arguments
* adding flatmm test
* fix test for flatmm
* simplify gemm kernel with flatmm
* remove flatmm related files
* addressing review comments and code clean up
* resolving empty file
* resolving empty file
* clang formatted
* addressing review comments
* enable persistent kernel for flatmm
* reverted the removed files for flatmm
* reverted the removed files for flatmm
* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example
* some more renames
* clang formatted
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* updates to support int8 in 03_gemm example
* added comments, using aliases, helper functions
* test(gemm_universal): add test cases for int8 gemm pipeline
* fix(test_gemm): fix for failing test unit test for int8
* test(ck_tile): add int8 unit test for gemm universal
* refactor(gemm_universal): GPU reference verification for GEMM code improved
* style(gemm_universal): removed extra comments and did clang format
* merging recent changes to universal gemm to tile_engine
* ck tile engine integration work
* feat(tile_engine): add int8 support to tile engine ops/gemm
* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8
* style: Format code with clang-format-12
* refactor(tile_engine): address review comments
* style: removed unhelpful comments & unused variables.
* build: tile engine uses default config
* feat: add int8 support for CK_TILE GEMM
* style: added trailing commas to codegen_utils.py
* refactor: tile engine
* refactor: formatting and code review
* refactor: code formatting for python files
* fix: suppress build warning
* add support for gfx950
* refactor:KWarpTile size in gemms util
* Fix the branch and wrap up the k warp tile
* Add bf8 integration
* refactor: clang format and rebase
---------
Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
* comp v4 setup
* add a file
* Finished the coding of the feature, Compiler not in the way we supposed to have
* Update some of the code to better format
* get tback the restrict variable name, need to switch out to solve the transpose issue
* Solve the compiler issue on SHMEM conflict
* roll back to compute pipeline
* Add the changes from include/ck_tile
* Address the comments
* Merge from internal (#1857)
* enable batched_gemm_softmax_gemm_perm_wmma for gfx12
* disable instances with blocksize=256 in attention examples
* debuggging
* debug
* fixed lds_enabled
* debugging
* Fix and add limit to skiplds feature
* Enable skipLds feature and fix compilation bugs
* add ck_tile definitions for gfx12
* fix clang format and test/wmma_op
* updage instances cmake for gfx12
* disable the test_wmma_op on gfx12
* fix the builds for gfx950
* add gfx12 and gfx950 to default target list
* clean-up cmake file
* Initial introduction of OFP8 data types.
* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.
* Implementation of ConvertFP32Nearest in test_fp8_ocp.
* Remove dependence on possibly undeclared alias.
* Implement FP8OCP test for stochastic rounding mode.
* Implement FP8OCP tests for half_t type conversions.
* enable bf16 atomic add on gfx950
* Implement ConvertFP32Nearest test.
* Implement ConvertFP32Stochastic test.
* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.
* Refactoring. Move FP8 definitions into a separate header file.
* Enable easy switching between architectures.
* Fix compilation error for gfx942 architecture.
* Add fp4 type with constants
* only builf gfx950 branch for gfx950 target by default
* Enable OCP build of example_gemm_xdl_fp8.
* Fix formatting.
* fix the build logic for gfx950
* Improve GEMM example verbosity.
* Add constexpr where applicable.
* fix the logic of enabling XDL and WMMA instances
* Improve GEMM example verbosity.
* Enable build of example_gemm_xdl_fp8_bf8 test.
* Fix tests for gfx1101 architecture.
* Build DPP examples only on gfx103 and gfx11 architectures.
* Optionaly run either CPU or GPU verifications with GEMM examples.
* Extend GeneratorTensor_Sequential to produce values of prescribed data types.
* Add missing constructor.
* Add scale type and mxfp conversions
* Update conversions
* Add conversion tests
* Fix typo
* Improve infrastructure for OFP8 data type support.
* BUGFIX. Should not use FP8 as Compute/Accum data type.
* Add custom target for grouped_convnd_bwd_weight tests.
* Can build `tests` target on gfx950.
* Bugfixes on gfx1101 architecture.
* Fix dependencies.
* Add stochastic rounding tests
* Provide single point of truth for FP8 INF and NAN checks
* Prevent instantiation of operators that are not supported by FP8 data types
* Add FP8 type selection into client_axample CMakeLists.txt
* Prevent sccache server from shutting down during build
* Fix test success reporting logic
* Change default verification method to CPU.
GPU verification takes too much time to complete on the emulator.
* Add scale <-> float conversions
* Add scaled conversions with tests
* Add device conversions
* Make sure all tests and examples are built for gfx950
* Facilitate testing of FP8 data types on the emulator
* Introduce two new tensor generators
* Enable instances built for gfx94 to be built on gfx950
* Verify 35_splitk_gemm on floating point numbers.
splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.
* Format
* Verify 04_gemm_add_add_fastgelu on floating point numbers
* Verify 20_grouped_conv_bwd_weight on floating point numbers
* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers
* Verify more tests on floating point data
* Fix data types and improve testing verbocity.
* Add fp4 vectors
* Add debug tests
* Upgrade to NPI 573 build docker.
* Skip on gemm_universal tests.
The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.
* Add new mfma instructions and examples
* Add preprocessor directives for gfx950 specific code
* Fix gfx1101 build
* Document test availability
* Re-enable fp8 gemms for gfx94/95
* Cherry-pick GEMM Universal tests for FP8 data types
* Cleanup
* Add vector types and tests
* Add check_err function
* Add tensor generators
* CK_USE_GFX94 has already been set on this branch
* Fix
* Address formatting issues and leftovers
* Make fail/pass logic consistent within 01_gemm folder
Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.
* Fix GPU verification reporting logic.
* Update year in copyright notice.
* Cleanup
* Use `enum class` instead of `enum`
* Remove set_property for FP8 tests
* Add vector conversions
* Fix
* Fix linker errror
* Clean up
* Fix gfx950 conversions
* Clean up
* Fix more gfx950 conversions
* Fix even more gfx950 conversions
* Narrowing the scope of PR to OCP FP8 enablement only
* Add tests for OCP FP8 vector_type storage
* Fix client examples build
* Fix typo
* Update e8m0 casting
* Rename E8M0 type
* Update unpack method
* Cleanup merge artifacts
* Enable gemm kernel on all gfx9 architectures (#227)
* clean-up
* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)
* Enable support of 1, 2, 4, and 8-byte custom types in CK.
* Fix pool tests for OCP FP8 data type
* Fix build
* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350
* fix clang format
* Add new mfma instructions and examples
* Add preprocessor directives for gfx950 specific code
* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350
* fix clang format
* Fix clang format for the newly merged files
* Use the existing example instances for fp16 bf16 and int8
* Remove comment on new mfma instructions in MfmaInstr
* Update include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_gemm_xdl_cshuffle_v1.hpp
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* merge from public repo
* Fix ck build
* Fix ck build
* Use double for max_abs_in_val
* Move scaled_type_convert functions to a separate header (#251)
* re-enable building mha lib and gemm_universal_f8 instances for gfx950
* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* fix typo for CK_USE_OCP_FP8
* fix typo for CK_USE_OCP_FP8
* Add FP6 and BF6 types (#261)
* Add a rounding flag
* Add FP6 and BF6
* Add tests
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* Clean up
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* fix one more typo
* Refactor E8M0 scale implementation (#262)
* Refactor E8M0 scale implementation
* Add MXFP6 and MXBF6 conversion methods (#270)
* Add conversions
* Add tests
* Add docstrings
* Add scaled conversions
* Add fp6/bf6 tests
* Remove misleading fp4 test case
* Add docstrings
* Clean up
* Address comments
* Set stricter tolerances for RNE tests
* Add missing tests
* Add native conversions to float
* Revert "Add native conversions to float"
This reverts commit 09467111f73b753c8cc3d597533b187940353dab.
* Update copyright years
* replace the fp6 with bf6 convert calls in test_bf6
* fix test_bf6
* enable smfmac test
* [MX FP8] Add Scaled Type Convert Functions for OCP FP8/BF8 data types (#271)
* Move scaled_type_convert functions to a separate header
* Introduce MX data tests
* Build MX tests only on relevant architectures
* Refactor E8M0 scale implementation
* Fix `config.h` typo
* Cleanup deprecated symbols
* Refactor `amd_ck_fp8.hpp`
* `scaled_type_convert` for `f8_ocp_t`
* Implement test for MX FP8 scaled type convert
* Implement test for MX BF8 scaled type convert
* Scaled type convert for vectors of 2 FP8 elements
* Scaled type convert for vectors of 16 FP8 elements
* Implementation of scaled conversion from F32 to F8
* Add tests for scaled conversions from FP32 to FP8
* Add documentation to the test functions
* Implementation of scaled conversion from F32x2 to F8x2
* Implementation of scaled conversion from F32x16 to F8x16
* Implementation of scaled conversion from F32x32 to F8x32
* Implementation of scaled conversion from F8x32 to F32x32
* Verified on the emulator
* MX FP GEMM - Example Template (#277)
Temporarily uses `DeviceGemmMultiD_ABScale_Xdl_CShuffle_V3` kernel and 128x128 scaling matrices.
Must be modified to use MX-native GEMM kernell with 16 or 32 component vectors per scale.
Verified on the emulator.
* Add vector support
* Add tests
* Add missing type aliases
* Fix test naming
* only build mx example for gfx950
* disable CK_USE_AMD_MFMA_GFX950 by default
* fic build for multiple archs
* fix typo
* fix typo
* Update unpack signature
* Fix merge
* Add size checks in pack function
* Add a flag
* Add conversions
* Fix build logic
* Update pack/unpack methods
* Remove unneeded AsType accessors
* Add docstrings
* Add a flag to config file
* Test the functionality of V_MFMA_F32_16X16X128_F8F6F4 and V_MFMA_F32_32X32X64_F8F6F4 instructions. (#293)
* Introduced MFMA tests
* Verified f8f6f4 MFMA Instructions
* Move flag logic to scaled_type_convert header
* Use pointers instead of array indices
* Fix a typo
* Update tests and pack functions
* Fix gemm gemm on gfx950
* Fix clang format
* restore the default gput target lists
* fix the jenkinsfile
* add missing ifdef
---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>
* Finish the integration to develop and have the correct result
* Fix the gtest compilation error
* Fix the gemm_basic error
* clang format
* switch the default pipeline to V3
* restore cron trigger (#1863)
* fix the benchmark basic script
* add vectorloads on non-k dim for memory pipelines (#1856)
* Solving the Review comments
* Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845)
* Support bf16/fb8/bf8 datatypes for ck_tile/gemm
* remove commented out code.
* Addressing code review comments and enabling universal_gemm for all the supported data types.
* Merge conflict resolution.
* Solve the memory pipeline compilation error. Merge with the new change of CShuffle
* finish the feature, pass the tests
* Fix the pipeline and add the benchmark script for other data types
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* CK Tile - small fix to hotloop scheduler & KPack value. (#1867)
* Use SmemPack in HotLoop scheduler
* Additional debug print information
* Change KPack value.
Hardcode for now, as without AK1/BK1 there's no good way to determine
its value.
* Fix HotLoopScheduler MFMA instr parameters.
* address the new comments
* fix a small bug on the old
* Add a host mx gemm reference kernel (#1864)
* Add mx gemm reference kernel
* Update copyright year
* Update mx gemm example
* Use element-wise ops in the reference gemm
* External CI: enable amd-develop branch trigger (#1859)
* Solve FMHA error
* clang format
* Fix the memory pipleine
* sync with develop
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
Co-authored-by: kylasa <sudhir.kylasa@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Daniel Su <danielsu@amd.com>
* Use SmemPack in HotLoop scheduler
* Additional debug print information
* Change KPack value.
Hardcode for now, as without AK1/BK1 there's no good way to determine
its value.
* Fix HotLoopScheduler MFMA instr parameters.
* Refactor universal gemm policy.
* Adapt example to refactor changes.
* Introduce static encoding pattern
* Adding shuffled encoding patterns.
* Fix err in reverse tuple.
* Add transpose_tile2d
* Small refactoring + doc
* Enable reading on contiguous dimension in all layouts.
* Transpose A/B register tile if needed for comp v3 pipeline.
* Take contiguous dim size when calculating dram vector load size.
* A/B smem pack size taken from WarpGemm attributes
* Update B LDS layout and setup tile distribution pattern at class level.
* Fix static assert.
* Fix errors in examples.
* Formatting & fix IsTranspose
* Fix VectorSize & refactor.
* Add error loging messages.
* Fix VecLoadSize and TranspseC for mem pipeline.
* Update unit-tests & disable mem pipeline.
* Clang format
* Update include/ck_tile/core/tensor/tile_window.hpp
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
* Fix compilation and reviewers comments.
* Refactor unit-test. Fallback to non-universal gemm.
Need to use GemmPipelineAGmemBGmemCRegV1 for now,
since GemmKernel is now supporting also non-K major vector reads.
---------
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm
* Finished the 2x2 warp gemm policy and the block selection mechanism
* Clang format
* address poyen's comment
* Address feedbacks
* Fixed the compilation issue
* Change the function name