mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-19 22:39:03 +00:00

Files

Anton Gorenko 52b4860a30 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>

2025-06-04 12:22:33 +06:00

CMakeLists.txt

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

common.hpp

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

gemm_dl_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dl_fp32.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_dl_int8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dpp_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_wmma_bf16_pk_i4_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_bf16_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_bf16.cpp

Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671 )

2024-11-18 14:07:04 -08:00

gemm_wmma_fp8_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_fp8_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_pk_i4_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_wmma_int8.cpp

Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671 )

2024-11-18 14:07:04 -08:00

gemm_xdl_bf16_pk_i4_v3.cpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

gemm_xdl_bf16_streamk_v3.cpp

BF16 GEMM Stream-K (#1541 )

2025-01-02 10:30:04 -08:00

gemm_xdl_bf16_v3.cpp

[GEMM] UniversalGemm update (#1262 )

2024-04-26 12:56:07 -05:00

gemm_xdl_bf16.cpp

BF16 GEMM Stream-K (#1541 )

2025-01-02 10:30:04 -08:00

gemm_xdl_fp8_bf8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp8_pk_i4_bpreshuffle_v3.cpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

gemm_xdl_fp8_pk_i4_v3.cpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

gemm_xdl_fp8_streamk_v3.cpp

universal streamk fp8 changes (#1665 )

2024-11-21 08:21:37 -08:00

gemm_xdl_fp8_v3.cpp

[GEMM] F8 GEMM, performance optimized. (#1384 )

2024-07-19 22:06:52 +08:00

gemm_xdl_fp8.cpp

Use new mfma instructions for FP8 on gfx950 (#2202 )

2025-05-19 17:29:51 -07:00

gemm_xdl_fp16_fp8_streamk_v3.cpp

f8/bf16 GEMM Stream-K (#1879 )

2025-03-31 20:30:17 -06:00

gemm_xdl_fp16_fp8_v3.cpp

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 )

2025-01-02 11:48:06 +08:00

gemm_xdl_fp16_fp8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp16_pk_i4_v3_b_scale.cpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

gemm_xdl_fp16_pk_i4_v3.cpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

gemm_xdl_fp16_streamk_v3.cpp

universal streamk fp8 changes (#1665 )

2024-11-21 08:21:37 -08:00

gemm_xdl_fp16_v2.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp16_v3.cpp

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 )

2025-01-02 11:48:06 +08:00

gemm_xdl_fp16.cpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

gemm_xdl_fp64.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_xdl_int8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_lds_direct_load_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_lds_direct_load_fp32.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_skip_b_lds_fp16.cpp

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768 )

2023-07-26 07:19:55 -07:00

gemm_xdl_streamk.cpp

Remove CK_USE_AMD_MFMA_GFX950 (#1935 )

2025-03-04 10:32:25 -08:00

gemm_xdl_wavelet_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

README.md

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

run_gemm_example_streamk_v2.inc

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

run_gemm_example_streamk.inc

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

run_gemm_example_v2.inc

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

run_gemm_example.inc

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

README.md

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_gemm_xdl 0 1 5

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`

arg1: verification (0=no, 1=yes)
arg2: initialization (0=no init, 1=integer value, 2=decimal value)
arg3: time kernel (0=no, 1=yes)
arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC
arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)
arg11: Grid_size(-1 for max occupancy)
bin/example_gemm_xdl_fp16_streamk_v3 1 2 1 3840 4096 4096 4096 4096 4096 1 -1
a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
problem {M:3840, N:4096, K:4096, SA:4096, SB:4096, SC:4096, MP:4032, NP:4096, KRead:4096, KP:4096, AK0:512, BK0:2048, MBlock: 18, NBlock: 16, Stream-K Selection:1, Grid size:-1}
Perf: 0.292022 ms, 441.23 TFlops, 330.348 GB/s, DeviceGemmXdlUniversal<MNPadding, RRR> BlkSize: 256, BlkTile: 224x256x64, WaveTile: 16x16, WaveMap: 7x8, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2

README.md

Instructions for example_gemm_xdl

Run example_gemm_xdl

Instructions for example_gemm_xdl_fp16_streamk_v3

Run example_gemm_xdl_fp16_streamk_v3

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`