mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Files

Haocong WANG 8c90f25be3 [GEMM] F8 GEMM, performance optimized. (#1384 )

* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------

Co-authored-by: root <jizhan@amd.com>

2024-07-19 22:06:52 +08:00

CMakeLists.txt

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

common.hpp

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

gemm_dl_fp16.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dl_fp32.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_dl_int8.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dpp_fp16.cpp

Redesign the DPP8 GEMM kernel to use warp-wise component (#863 )

2023-09-06 11:44:09 -05:00

gemm_wmma_fp16.cpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

gemm_xdl_bf16_rtn.cpp

add an example of customized type convert - bfp16_rtn (#869 )

2023-08-29 12:31:24 -05:00

gemm_xdl_bf16_v3.cpp

[GEMM] UniversalGemm update (#1262 )

2024-04-26 12:56:07 -05:00

gemm_xdl_bf16.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_fp8_bf8.cpp

Fix example_gemm_xdl_fp8 (#1183 )

2024-03-01 16:42:15 -08:00

gemm_xdl_fp8_v3.cpp

[GEMM] F8 GEMM, performance optimized. (#1384 )

2024-07-19 22:06:52 +08:00

gemm_xdl_fp8.cpp

Fix example_gemm_xdl_fp8 (#1183 )

2024-03-01 16:42:15 -08:00

gemm_xdl_fp16_fp8_v3.cpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_xdl_fp16_fp8.cpp

Refactor tolerances for correctness check in gemm op (#1188 )

2024-03-08 12:05:05 -08:00

gemm_xdl_fp16_streamk_v3.cpp

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

gemm_xdl_fp16_v2.cpp

[GEMM] Optimization for MI200/300. (#1135 )

2024-01-19 07:02:22 -06:00

gemm_xdl_fp16_v3.cpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_xdl_fp16.cpp

Fix cluster length arrange order in fp16 GEMM example (#1055 )

2023-11-27 11:31:14 +01:00

gemm_xdl_fp64.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_xdl_int8.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_lds_direct_load_fp16.cpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

gemm_xdl_lds_direct_load_fp32.cpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

gemm_xdl_skip_b_lds_fp16.cpp

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768 )

2023-07-26 07:19:55 -07:00

gemm_xdl_streamk.cpp

initial stream-k implementation with example (#699 )

2023-07-26 14:18:15 -05:00

gemm_xdl_wavelet_fp16.cpp

Move Device Ops implementations into impl directory. (#777 )

2023-07-06 16:15:51 +02:00

README.md

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

run_gemm_example_streamk_v2.inc

Universal streamk with atomics (#1360 )

2024-07-05 21:40:30 -07:00

run_gemm_example_v2.inc

[GEMM] UniversalGemm update (#1262 )

2024-04-26 12:56:07 -05:00

run_gemm_example.inc

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

README.md

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_gemm_xdl 0 1 5

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`

arg1: verification (0=no, 1=yes)
arg2: initialization (0=no init, 1=integer value, 2=decimal value)
arg3: time kernel (0=no, 1=yes)
arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC
arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)
arg11: Grid_size(-1 for max occupancy)
bin/example_gemm_xdl_fp16_streamk_v3 1 2 1 3840 4096 4096 4096 4096 4096 1 -1
a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
problem {M:3840, N:4096, K:4096, SA:4096, SB:4096, SC:4096, MP:4032, NP:4096, KRead:4096, KP:4096, AK0:512, BK0:2048, MBlock: 18, NBlock: 16, Stream-K Selection:1, Grid size:-1}
Perf: 0.292022 ms, 441.23 TFlops, 330.348 GB/s, DeviceGemmXdlUniversal<MNPadding, RRR> BlkSize: 256, BlkTile: 224x256x64, WaveTile: 16x16, WaveMap: 7x8, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2

README.md

Instructions for example_gemm_xdl

Run example_gemm_xdl

Instructions for example_gemm_xdl_fp16_streamk_v3

Run example_gemm_xdl_fp16_streamk_v3

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

Instructions for `example_gemm_xdl_fp16_streamk_v3`

Run `example_gemm_xdl_fp16_streamk_v3`