27 Commits

Author SHA1 Message Date
Johannes Graner
58475d3f45 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions
 (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 09:18:14 +00:00
SamiAario-AMD
95dc496d30 [rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)
Cleanup and refactoring related to tile loading
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Cleanup and refactoring done while implementing mixed precision for
fp16/bf16 x fp8

Key changes:

- Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and
refactored the API to use consistent naming conventions
- Updated load_tile_transpose functions to use output parameters instead
of return values for consistency
- Removed unused variable declarations and simplified type deduction
logic
- Define load_tile_with_elementwise to use tuple types explicitly for
clarity

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-03-02 12:21:44 +00:00
Enrico Degregori
4c626aeaa6 [rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant
 (#4267)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-24 17:57:02 +00:00
Emily Martins
fc3180120e [rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum

## Motivation

Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.

## Technical Details

Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic        = 0u,
    Reduction     = 1u,
    TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic = 0u,
    Linear = 1u,
    Tree   = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan

No new functionality was added, so no new tests were added; I just
validated existing tests and examples.

## Test Result

All tests passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 06:41:15 +00:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
Emily Martins
22b945e06e [CK_TILE] Stream-K Tree Reduction and Cache Skipping Integration (#3371)
* CK Tile Stream-K Tree Reduction

This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.

Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.

Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.

* Move Stream-K kernel files to a subdirectory

* Cleanup Code Style & Handle Unsupported Reductions

This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable

* Run new copyright script on new files
2025-12-14 14:49:49 -07:00
eliotwang
715671e419 Bf16*fp4 gemm (#2801)
* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* rebase to new develop

* fix clang format

* update code according to reviewer's comment

* Update README.md

* update code according to reviewer's comment

* update code according to reviewer's comment

* Update CMakeLists.txt

* Update README.md

* Update CMakeLists.txt

* Delete files

* Delete files

* Add unit tests

* Update test_gemm_quant_base.hpp

* merge bf16*fp4 example to develop branch

* fix clang format

* fix clang format

* Update CMakeLists.txt

* fix ci test

* fix clang format

* resolve conflicts

---------

Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-11 07:20:29 -08:00
Aviral Goel
6cb0bc2d11 feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* feat(gemm_quant): add RRR and CRR layout support for aquant gemm

* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm

* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests

* fix: test cases compilation failure due to PR# 2095

* fix: make condition to filter out tests for gfx950 more explicit

* need to support the gfx950

* fix: add layout suppot for gfx950

* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)

* WIP: add support for pk_int4_t for aquant mode layouts RR and CR

* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant

* fix: compile time error for gfx950

* fix: minor bug where is_a_load_tr_v() was mising

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* docs: update changelog with new layout support info

* Update CHANGELOG.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* refactor: break test instances into multiple cpp files to reduce build time (#3319)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* refactor: break test instances into multiple cpp files to reduce build time

* chore: rename file for better code readability

* fix: merge conflict resolution

* fix: remove memory pipeline because new layout is not compatible

* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()

* refactor: address review comments

* solve the conflict

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-02 14:59:07 -08:00
arai713
24d88d2472 [CK_TILE] Move DataTypeTraits into a Common File (#3146)
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-11-27 09:09:54 -08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
SamiAario-AMD
f2cfc6b94e Remove "basic" and universal GEMM tests, and incorporate their test cases into the GEMM pipeline tests (#3094)
* Add missing copyright statements

* Use ck_tile::host_tensor_descriptor instead of a custom lambda

* Refactor use of check_data_type in test classes

* Use TEST_SUITE_NAME with TYPED_TEST_SUITE

* Remove an unused namespace

* Make dim3 const

* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile

* Add missing precision type combinations to CompV4 from CompV3

* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma

* Add missing precision type combinations to CompV3Wmma from CompV3

* Remove the basic and universal tests and their dependencies

* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B

* Use ADataType and BDataType instead of ComputeDataType for WarpGemm

* Explicitly set some return types to void

* Use more general typenames in InterleavedPKTypeLoader

* Add load_interleaved_pk_type.hpp to common.hpp

* Use std::is_same_v in load_int4_tile

* Add handling of LoadTranspose to load_int4_tile

* Factor out common code in several places using load_int4_tile

* Add support for pk_int4_t using load_int4_tile

* Fix formatting
2025-11-13 11:01:27 -08:00
Ville Pietilä
9ee9f4d2a3 [CK_TILE] Improve grouped conv kernel name generation (#3028)
* Improve the grouped conv kernel name generation in CK Tile.

* Fix building CShuffle epilogue tests.

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-10-30 13:19:07 +01:00
Emily Martins
352dee5225 Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.
2025-10-17 09:33:38 -07:00
damien-lejeune
46c10c316d Update include path to break the remod's cyclic dep issue (#2978)
* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-13 13:24:47 +02:00
SamiAario-AMD
47cd0d5cff Add gemm weight preshuffle pk_int_t support (#2858)
* Factor out the three separate copies of load_interleaved_pk_type into a common utility class

* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example

* Remove a duplicate function

* Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included

* I4 support introduced more failing test cases that mirror the existing ones for F8

* Simplify the check for which tests to skip (they all have F8 as A tensor type)

* Add a changelog entry

* add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-09-18 21:26:10 -07:00
linqunAMD
b7a806f244 [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f… (#2837)
* [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f3d7b4 )

WarpPerBlock_M * WarpPerBlock_N are not equal with ThreadPerBlock_M * ThreadPerBlock_N /warpSize. we should calculate BlockSize from WarpPerBlock_M * WarpPerBlock_N

To compatible with wave32, function GetBlockSize is added to calculate correct size in host side.

* fix blocksize for all kernel related with generic2dblockshap

* remove constexpr for blocks
2025-09-16 08:47:55 -07:00
linqunAMD
c254f3d7b4 [CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795)
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.

Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape
2025-09-10 08:29:20 +08:00
arai713
0282d98412 [CK TILE] Stream-K tile partitioner (#2708)
* initial commit for skeleton code

* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code

* fixed up code to match CK Tile convention: data type changes, naming changes, etc.

* change for num_sk_blocks data type

* formatting fix

* minor fixes

* moved reduction argument to template

* resolved comments from PR review: standardizing naming, pruning unneeded code

* resolve errors from merge of device op PR: moved enum to common file

* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems

* unsigned type fix

* add const qualifier

* added documentation for template parameters

* documentation edit
2025-09-03 13:38:17 -07:00
Eisuke Kawashima
4e586ca958 chore: unset executable permission (#2303)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2025-06-10 09:13:59 -07:00
BingYuan.Zhou
5a0d693b86 fix ck_tile/basic_gemm build error (#1988) 2025-03-20 22:01:14 -07:00
aledudek
2312f4aaf9 [CK_TILE] Add GetName for GEMM kernels (#1791)
* [CK_TILE] Add GetName functions for Gemm Kernels

* [CK_TILE] Add GetName for grouped gemm

* [CK_TILE] Add GetName for gemm - review changes

* [CK_TILE] Print also gemm problem pipeline and shape

* [CK_TILE] Print also GemmPipelineScheduler

* [CK_TILE] GetName - fixed Scheduler <<operator visibility

* [CK_TILE] GetName info adjustments

* [CK_TILE] GetName post-merge fix

* [CK_TILE] GetName - add general concat function

* [CK_TILE] GetName - small adjustments, format change

* post merge develop fix

* Remove commented code

* Merge from internal (#1857)

* enable batched_gemm_softmax_gemm_perm_wmma for gfx12

* disable instances with blocksize=256 in attention examples

* debuggging

* debug

* fixed lds_enabled

* debugging

* Fix and add limit to skiplds feature

* Enable skipLds feature and fix compilation bugs

* add ck_tile definitions for gfx12

* fix clang format and test/wmma_op

* updage instances cmake for gfx12

* disable the test_wmma_op on gfx12

* fix the builds for gfx950

* add gfx12 and gfx950 to default target list

* clean-up cmake file

* Initial introduction of OFP8 data types.

* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.

* Implementation of ConvertFP32Nearest in test_fp8_ocp.

* Remove dependence on possibly undeclared alias.

* Implement FP8OCP test for stochastic rounding mode.

* Implement FP8OCP tests for half_t type conversions.

* enable bf16 atomic add on gfx950

* Implement ConvertFP32Nearest test.

* Implement ConvertFP32Stochastic test.

* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.

* Refactoring. Move FP8 definitions into a separate header file.

* Enable easy switching between architectures.

* Fix compilation error for gfx942 architecture.

* Add fp4 type with constants

* only builf gfx950 branch for gfx950 target by default

* Enable OCP build of example_gemm_xdl_fp8.

* Fix formatting.

* fix the build logic for gfx950

* Improve GEMM example verbosity.

* Add constexpr where applicable.

* fix the logic of enabling XDL and WMMA instances

* Improve GEMM example verbosity.

* Enable build of example_gemm_xdl_fp8_bf8 test.

* Fix tests for gfx1101 architecture.

* Build DPP examples only on gfx103 and gfx11 architectures.

* Optionaly run either CPU or GPU verifications with GEMM examples.

* Extend GeneratorTensor_Sequential to produce values of prescribed data types.

* Add missing constructor.

* Add scale type and mxfp conversions

* Update conversions

* Add conversion tests

* Fix typo

* Improve infrastructure for OFP8 data type support.

* BUGFIX. Should not use FP8 as Compute/Accum data type.

* Add custom target for grouped_convnd_bwd_weight tests.

* Can build `tests` target on gfx950.

* Bugfixes on gfx1101 architecture.

* Fix dependencies.

* Add stochastic rounding tests

* Provide single point of truth for FP8 INF and NAN checks

* Prevent instantiation of operators that are not supported by FP8 data types

* Add FP8 type selection into client_axample CMakeLists.txt

* Prevent sccache server from shutting down during build

* Fix test success reporting logic

* Change default verification method to CPU.

GPU verification takes too much time to complete on the emulator.

* Add scale <-> float conversions

* Add scaled conversions with tests

* Add device conversions

* Make sure all tests and examples are built for gfx950

* Facilitate testing of FP8 data types on the emulator

* Introduce two new tensor generators

* Enable instances built for gfx94 to be built on gfx950

* Verify 35_splitk_gemm on floating point numbers.

splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.

* Format

* Verify 04_gemm_add_add_fastgelu on floating point numbers

* Verify 20_grouped_conv_bwd_weight on floating point numbers

* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers

* Verify more tests on floating point data

* Fix data types and improve testing verbocity.

* Add fp4 vectors

* Add debug tests

* Upgrade to NPI 573 build docker.

* Skip on gemm_universal tests.

The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.

* Add new mfma instructions and examples

* Add preprocessor directives for gfx950 specific code

* Fix gfx1101 build

* Document test availability

* Re-enable fp8 gemms for gfx94/95

* Cherry-pick GEMM Universal tests for FP8 data types

* Cleanup

* Add vector types and tests

* Add check_err function

* Add tensor generators

* CK_USE_GFX94 has already been set on this branch

* Fix

* Address formatting issues and leftovers

* Make fail/pass logic consistent within 01_gemm folder

Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.

* Fix GPU verification reporting logic.

* Update year in copyright notice.

* Cleanup

* Use `enum class` instead of `enum`

* Remove set_property for FP8 tests

* Add vector conversions

* Fix

* Fix linker errror

* Clean up

* Fix gfx950 conversions

* Clean up

* Fix more gfx950 conversions

* Fix even more gfx950 conversions

* Narrowing the scope of PR to OCP FP8 enablement only

* Add tests for OCP FP8 vector_type storage

* Fix client examples build

* Fix typo

* Update e8m0 casting

* Rename E8M0 type

* Update unpack method

* Cleanup merge artifacts

* Enable gemm kernel on all gfx9 architectures (#227)

* clean-up

* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)

* Enable support of 1, 2, 4, and 8-byte custom types in CK.

* Fix pool tests for OCP FP8 data type

* Fix build

* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350

* fix clang format

* Add new mfma instructions and examples

* Add preprocessor directives for gfx950 specific code

* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350

* fix clang format

* Fix clang format for the newly merged files

* Use the existing example instances for fp16 bf16 and int8

* Remove comment on new mfma instructions in MfmaInstr

* Update include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_gemm_xdl_cshuffle_v1.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* merge from public repo

* Fix ck build

* Fix ck build

* Use double for max_abs_in_val

* Move scaled_type_convert functions to a separate header (#251)

* re-enable building mha lib and gemm_universal_f8 instances for gfx950

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* fix typo for CK_USE_OCP_FP8

* fix typo for CK_USE_OCP_FP8

* Add FP6 and BF6 types (#261)

* Add a rounding flag

* Add FP6 and BF6

* Add tests

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Clean up

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* fix one more typo

* Refactor E8M0 scale implementation (#262)

* Refactor E8M0 scale implementation

* Add MXFP6 and MXBF6 conversion methods (#270)

* Add conversions

* Add tests

* Add docstrings

* Add scaled conversions

* Add fp6/bf6 tests

* Remove misleading fp4 test case

* Add docstrings

* Clean up

* Address comments

* Set stricter tolerances for RNE tests

* Add missing tests

* Add native conversions to float

* Revert "Add native conversions to float"

This reverts commit 09467111f73b753c8cc3d597533b187940353dab.

* Update copyright years

* replace the fp6 with bf6 convert calls in test_bf6

* fix test_bf6

* enable smfmac test

* [MX FP8] Add Scaled Type Convert Functions for OCP FP8/BF8 data types (#271)

* Move scaled_type_convert functions to a separate header

* Introduce MX data tests

* Build MX tests only on relevant architectures

* Refactor E8M0 scale implementation

* Fix `config.h` typo

* Cleanup deprecated symbols

* Refactor `amd_ck_fp8.hpp`

* `scaled_type_convert` for `f8_ocp_t`

* Implement test for MX FP8 scaled type convert

* Implement test for MX BF8 scaled type convert

* Scaled type convert for vectors of 2 FP8 elements

* Scaled type convert for vectors of 16 FP8 elements

* Implementation of scaled conversion from F32 to F8

* Add tests for scaled conversions from FP32 to FP8

* Add documentation to the test functions

* Implementation of scaled conversion from F32x2 to F8x2

* Implementation of scaled conversion from F32x16 to F8x16

* Implementation of scaled conversion from F32x32 to F8x32

* Implementation of scaled conversion from F8x32 to F32x32

* Verified on the emulator

* MX FP GEMM - Example Template (#277)

Temporarily uses `DeviceGemmMultiD_ABScale_Xdl_CShuffle_V3` kernel and 128x128 scaling matrices.
Must be modified to use MX-native GEMM kernell with 16 or 32 component vectors per scale.

Verified on the emulator.

* Add vector support

* Add tests

* Add missing type aliases

* Fix test naming

* only build mx example for gfx950

* disable CK_USE_AMD_MFMA_GFX950 by default

* fic build for multiple archs

* fix typo

* fix typo

* Update unpack signature

* Fix merge

* Add size checks in pack function

* Add a flag

* Add conversions

* Fix build logic

* Update pack/unpack methods

* Remove unneeded AsType accessors

* Add docstrings

* Add a flag to config file

* Test the functionality of V_MFMA_F32_16X16X128_F8F6F4 and  V_MFMA_F32_32X32X64_F8F6F4 instructions. (#293)

* Introduced MFMA tests

* Verified f8f6f4 MFMA Instructions

* Move flag logic to scaled_type_convert header

* Use pointers instead of array indices

* Fix a typo

* Update tests and pack functions

* Fix gemm gemm on gfx950

* Fix clang format

* restore the default gput target lists

* fix the jenkinsfile

* add missing ifdef

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>

* restore cron trigger (#1863)

* add vectorloads on non-k dim for memory pipelines (#1856)

* Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845)

* Support bf16/fb8/bf8 datatypes for ck_tile/gemm

* remove commented out code.

* Addressing code review comments and enabling universal_gemm for all the supported data types.

* Merge conflict resolution.

* Solve the memory pipeline compilation error. Merge with the new change of CShuffle

* finish the feature, pass the tests

* Fix the pipeline and add the benchmark script for other data types

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

* Extract prec_str and add separator to concat

* GetName add

* CK Tile - small fix to hotloop scheduler & KPack value. (#1867)

* Use SmemPack in HotLoop scheduler

* Additional debug print information

* Change KPack value.

Hardcode for now, as without AK1/BK1 there's no good way to determine
its value.

* Fix HotLoopScheduler MFMA instr parameters.

* Resolve merge issues

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
Co-authored-by: kylasa <sudhir.kylasa@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-02-11 09:49:48 +01:00
dummycoderfe
686a58a912 [Ck tile] layernorm2d fwd optimize (#1637)
* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------

Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-11-08 12:28:23 +08:00
carlushuang
c3a4800c5f [CK_TILE] layernorm support fused-quant/fused-add (#1604)
* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment
2024-10-31 14:54:53 +08:00
carlushuang
db376dd8a4 introducing ck_tile! (#1216)
* enable gfx940

* switch between intrinsic mfma routines on mi100/200 and mi300

* fix mfma_int8 on MI300

* disable 2 int8 examples on MI300

* Update cmake-ck-dev.sh

* restore gitignore file

* modify Jenkinsfile to the internal repo

* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* initial enablement of gfx950

* fix clang format

* disable examples 31 and 41 int8 on gfx950

* add code

* fix build wip

* fix xx

* now can build

* naming

* minor fix

* wip fix

* fix macro for exp2; fix warpgemm a/b in transposedC

* unify as tuple_array

* Update the required Python version to 3.9

* Update executable name in test scripts

* re-structure tuple/array to avoid spill

* Merge function templates

* Fix format

* Add constraint to array<> ctor

* Re-use function

* Some minor changes

* remove wrong code in store_raw()

* fix compile issue in transpose

* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'

* let more integral_constant->constant, and formating

* make sure thread_buffer can be tuple/array

* temp fix buffer_store spill

* not using custom data type by default, now we can have ISA-level same code as opt_padding

* fix compile error, fp8 not ready now

* fix fp8 duplicated move/shift/and/or problem

* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode

* fix scratch in fp8 kernel

* update some readme

* fix merge from upstream

* sync with upstream

* sync upstream again

* sync 22

* remove unused

* fix clang-format

* update README of ck_tile example

* fix several issue

* let python version to be 3.8 as minimal

* remove ck_tile example from default cmake target like all/install/check

* remove mistake

* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg

* fix some bug in group-mode masking and codegen. update README

* F8 quantization for FMHA forward (#1224)

* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline

* Add element function to fmha api

* Adjust P elementwise function

* Fix bug of elementwise op, our elementwise op is not inout

* Add some elementwise op, prepare to quantization

* Let generate.py can generate different elementwise function

* To prevent compiler issue, remove the elementwise function we have not used.

* Remove f8 pipeline, we should share the same pipeline even in f8

* Remove remove_cvref_t

* Avoid warning

* Fix wrong fp8 QK/KV block gemm setting

* Check fp8 rounding error in check_err()

* Set fp8 rounding error for check_err()

* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode

* 1. codgen the f8 api and kernel
2. f8 host code

* prevent warning in filter mode

* Remove not-in-use elementwise function kargs

* Remove more not-in-use elementwise function kargs

* Small refinements in C++ source files

* Use conditional_t<> to simplify code

* Support heterogeneous argument for binary function types

* Re-use already-existing scales<> functor template

* Fix wrong value produced by saturating

* Generalize the composes<> template

* Unify saturates<> implementation

* Fix type errors in composes<>

* Extend less_equal<>

* Reuse the existing template less_equal<> in check_err()

* Add equal<float> & equal<double>

* Rename check_err() parameter

* Rename check_err() parameter

* Add FIXME comment for adding new macro in future

* Remove unnecessary cast to void

* Eliminate duplicated code

* Avoid dividing api pool into more than 2 groups

* Use more clear variable names

* Use affirmative condition in if stmt

* Remove blank lines

* Donot perfect forwarding in composes<>

* To fix compile error, revert generate.py back to 4439cc107d

* Fix bug of p element function

* Add compute element op to host softmax

* Remove element function in api interface

* Extract user parameter

* Rename pscale and oscale variable

* rename f8 to fp8

* rename more f8 to fp8

* Add pipeline::operator() without element_functor

* 1. Remove deprecated pipeline enum
2. Refine host code parameter

* Use quantization range as input

* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description

* Refine description

* prevent early return

* unify _squant kernel name in cpp, update README

* Adjust the default range.

* Refine error message and bias range

* Add fp8 benchmark and smoke test

* fix fp8 swizzle_factor=4 case

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
2024-04-15 19:27:12 -05:00