mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Files

Haocong WANG f83e9701e9 [GEMM] Gemm universal device operation (#1154 )

* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>

2024-04-13 21:03:18 -05:00

CMakeLists.txt

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

common.hpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_dl_fp16.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dl_fp32.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_dl_int8.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_dpp_fp16.cpp

Redesign the DPP8 GEMM kernel to use warp-wise component (#863 )

2023-09-06 11:44:09 -05:00

gemm_wmma_fp16.cpp

Navi3 rel (#1176 )

2024-03-08 17:11:51 -08:00

gemm_xdl_bf16_rtn.cpp

add an example of customized type convert - bfp16_rtn (#869 )

2023-08-29 12:31:24 -05:00

gemm_xdl_bf16.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_fp8_bf8.cpp

Fix example_gemm_xdl_fp8 (#1183 )

2024-03-01 16:42:15 -08:00

gemm_xdl_fp8_v3.cpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_xdl_fp8.cpp

Fix example_gemm_xdl_fp8 (#1183 )

2024-03-01 16:42:15 -08:00

gemm_xdl_fp16_fp8_v3.cpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_xdl_fp16_fp8.cpp

Refactor tolerances for correctness check in gemm op (#1188 )

2024-03-08 12:05:05 -08:00

gemm_xdl_fp16_v2.cpp

[GEMM] Optimization for MI200/300. (#1135 )

2024-01-19 07:02:22 -06:00

gemm_xdl_fp16_v3.cpp

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

gemm_xdl_fp16.cpp

Fix cluster length arrange order in fp16 GEMM example (#1055 )

2023-11-27 11:31:14 +01:00

gemm_xdl_fp64.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_int4.cpp

Fixing most of the cppcheck errors. (#1142 )

2024-01-24 13:47:48 -08:00

gemm_xdl_int8.cpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

gemm_xdl_lds_direct_load_fp16.cpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

gemm_xdl_lds_direct_load_fp32.cpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

gemm_xdl_skip_b_lds_fp16.cpp

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768 )

2023-07-26 07:19:55 -07:00

gemm_xdl_streamk.cpp

initial stream-k implementation with example (#699 )

2023-07-26 14:18:15 -05:00

gemm_xdl_wavelet_fp16.cpp

Move Device Ops implementations into impl directory. (#777 )

2023-07-06 16:15:51 +02:00

README.md

Compile for gfx908 and gfx90a (#130 )

2022-03-31 12:33:34 -05:00

run_gemm_example_v2.inc

[GEMM] Gemm universal device operation (#1154 )

2024-04-13 21:03:18 -05:00

run_gemm_example.inc

Navi3 rel (#1176 )

2024-03-08 17:11:51 -08:00

README.md

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_gemm_xdl 0 1 5

Result (MI100 @ 1087Mhz, 133.5TFlops peak FP16)

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.19685 ms, 107.657 TFlops, 78.8501 GB/s

README.md

Instructions for example_gemm_xdl

Run example_gemm_xdl

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`