mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 02:05:50 +00:00

Files

emezh 3c207a18b0 Verify HostTensorDescriptor when it is created (#2829 )

* add proper GEMM layout verification

* Handle "auto" strides.

CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides).
CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims.
ValidateStrides throws if any of the tensor's strides is <=0.
profile_gemm_multiply_add updated to support "auto" strides for tensors.

Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes)
auto-strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1
Note, -1 should be deprecated (use 0 instead)

explicit strides (same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128

explicit strides (not the same as auto)
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
	bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138

mix of explicit and auto strides
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0

invalid stride
	bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64
	terminate called after throwing an instance of 'std::runtime_error'
	  what():  Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1
	Aborted (core dumped)

* - add more names to ck::tensor_layout for easier namespace hierarchy checking
- updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp

* added handling of partially initialized strides for GEMM. fixed more tests.

* clang-format and more fixes

* replace long dash by a simple hyphen - causes build failure in CK codegen.

* increase sizeof input, otherwise output size becomes zero or negative with large filter size

* select stride based on layout

* specify layout explicitly to avoid errors in HostTensorDescriptor creation

* add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor`

* Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors

* fix test (incl removing invalid config)

* fix moe examples:
- (in .cpp) add layout argument to non-2D tensors
- (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index)

* fix moe_gemm2 example.

* fix profile and wmma examples

* clean-up early mods for ckprofile. verified with:
```
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0
ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138
ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138
#
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0
ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128
#
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0
# ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0    # not implemented
# ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0    # not implemented
ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128
#
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0
ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138
#
example_gemm_add_multiply_dl_fp16
example_gemm_add_multiply_xdl_fp16
#
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0
ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128
```

* temporary skip first 8 test configs - they throw error

* temporary skip first 8 test configs in wmma too - they throw error

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: db2524be2d]

2025-09-25 18:22:13 -07:00

CMakeLists.txt

TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 )

2025-09-17 14:50:15 -07:00

common.hpp

TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 )

2025-09-17 14:50:15 -07:00

gemm_dl_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dl_fp32.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_dl_int8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_dpp_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_wmma_bf16_pk_i4_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_bf16_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_bf16.cpp

Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671 )

2024-11-18 14:07:04 -08:00

gemm_wmma_fp8_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_fp8_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_pk_i4_v3_b_scale.cpp

Support b_scale: (#2350 )

2025-07-24 18:49:58 -07:00

gemm_wmma_fp16_pk_i4_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16_v3.cpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

gemm_wmma_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_wmma_int8.cpp

Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671 )

2024-11-18 14:07:04 -08:00

gemm_xdl_bf16_pk_i4_v3.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_bf16_streamk_v3.cpp

chore: unset executable permission (#2303 )

2025-06-10 09:13:59 -07:00

gemm_xdl_bf16_v3.cpp

[GEMM] UniversalGemm update (#1262 )

2024-04-26 12:56:07 -05:00

gemm_xdl_bf16.cpp

chore: unset executable permission (#2303 )

2025-06-10 09:13:59 -07:00

gemm_xdl_fp8_bf8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp8_pk_i4_bpreshuffle_v3.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_fp8_pk_i4_v3.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_fp8_streamk_v3.cpp

chore: unset executable permission (#2303 )

2025-06-10 09:13:59 -07:00

gemm_xdl_fp8_v3.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_fp8.cpp

Use new mfma instructions for FP8 on gfx950 (#2202 )

2025-05-19 17:29:51 -07:00

gemm_xdl_fp16_fp8_streamk_v3.cpp

f8/bf16 GEMM Stream-K (#1879 )

2025-03-31 20:30:17 -06:00

gemm_xdl_fp16_fp8_v3.cpp

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 )

2025-01-02 11:48:06 +08:00

gemm_xdl_fp16_fp8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp16_pk_i4_v3_b_scale.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_fp16_pk_i4_v3.cpp

Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 )

2025-09-12 08:17:07 -07:00

gemm_xdl_fp16_streamk_v3.cpp

universal streamk fp8 changes (#1665 )

2024-11-21 08:21:37 -08:00

gemm_xdl_fp16_v2.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_fp16_v3.cpp

Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 )

2025-01-02 11:48:06 +08:00

gemm_xdl_fp16.cpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

gemm_xdl_fp64.cpp

upgrade from clang-format-12 to clang-format-18 (#2568 )

2025-07-28 11:34:07 -07:00

gemm_xdl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_xdl_int8.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_lds_direct_load_fp16.cpp

Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 )

2025-06-12 09:25:59 +08:00

gemm_xdl_lds_direct_load_fp32_tf32.cpp

TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 )

2025-09-17 14:50:15 -07:00

gemm_xdl_lds_direct_load_fp32.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

gemm_xdl_skip_b_lds_fp16.cpp

[CK] Fix misc issues in CK examples (#2890 )

2025-09-24 11:28:20 -07:00

gemm_xdl_streamk.cpp

Remove CK_USE_AMD_MFMA_GFX950 (#1935 )

2025-03-04 10:32:25 -08:00

gemm_xdl_wavelet_fp16.cpp

Add a gpu gemm reference kernel (#1528 )

2024-10-08 11:05:28 -05:00

README.md

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

run_gemm_example_streamk_v2.inc

Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) (#2145 )

2025-06-11 10:59:44 -07:00

run_gemm_example_streamk.inc

Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 )

2025-06-03 07:26:58 -07:00

run_gemm_example_v2.inc

Mirchen/gemm blockscale wp segfault fix (#2638 )

2025-08-19 01:19:17 -07:00

run_gemm_example.inc

Verify HostTensorDescriptor when it is created (#2829 )

2025-09-25 18:22:13 -07:00

README.md

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_gemm_xdl 0 1 5

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`

arg1: verification (0=no, 1=yes)
arg2: initialization (0=no init, 1=integer value, 2=decimal value)
arg3: time kernel (0=no, 1=yes)
arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC
arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)
arg11: Grid_size(-1 for max occupancy)
bin/example_gemm_xdl_fp16_streamk_v3 1 2 1 3840 4096 4096 4096 4096 4096 1 -1
a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
problem {M:3840, N:4096, K:4096, SA:4096, SB:4096, SC:4096, MP:4032, NP:4096, KRead:4096, KP:4096, AK0:512, BK0:2048, MBlock: 18, NBlock: 16, Stream-K Selection:1, Grid size:-1}
Perf: 0.292022 ms, 441.23 TFlops, 330.348 GB/s, DeviceGemmXdlUniversal<MNPadding, RRR> BlkSize: 256, BlkTile: 224x256x64, WaveTile: 16x16, WaveMap: 7x8, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2

README.md

Instructions for example_gemm_xdl

Run example_gemm_xdl

Instructions for example_gemm_xdl_fp16_streamk_v3

Run example_gemm_xdl_fp16_streamk_v3

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`