mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 17:26:00 +00:00

Files

Haocong WANG 0cfda84d05 [Navi3x] Add Device Operations (#567 )

* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

* navi3x_multipleD+example

* temp save

* workable

* batchedgemm[OK], groupconv[debug]

* groupconv: Sanity check[OK], Performance[Bad]

* navi3x_groupconv_need_optimization

* format

* Add arch limitation to all wmma examples

* fix bug: example30 input conv args

2023-02-15 11:50:51 -06:00

CMakeLists.txt

[Navi3x] Add Device Operations (#567 )

2023-02-15 11:50:51 -06:00

common.hpp

Add examples of Gemm (data type: int4) (#367 )

2022-08-23 18:25:05 -05:00

gemm_dl_fp16.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_dl_fp32.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_dl_int4.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_dl_int8.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_wmma_fp16.cpp

[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541 )

2023-01-16 20:06:01 -06:00

gemm_xdl_bf16.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_xdl_fp16.cpp

Wavelet (inter-wave consumer-producer) GEMM (#310 )

2023-01-18 12:00:02 -06:00

gemm_xdl_fp64.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_xdl_int4.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_xdl_int8.cpp

Refactor device op implementations into impl subdirectory. (#420 )

2022-10-13 09:05:08 -05:00

gemm_xdl_skip_b_lds_fp16.cpp

Rangify constructor of HostTensorDescriptor & Tensor<> (#445 )

2022-11-11 11:36:01 -06:00

gemm_xdl_wavelet_fp16.cpp

Wavelet (inter-wave consumer-producer) GEMM (#310 )

2023-01-18 12:00:02 -06:00

README.md

Compile for gfx908 and gfx90a (#130 )

2022-03-31 12:33:34 -05:00

run_gemm_example.inc

Rangify constructor of HostTensorDescriptor & Tensor<> (#445 )

2022-11-11 11:36:01 -06:00

README.md

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`

#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
./bin/example_gemm_xdl 0 1 5

Result (MI100 @ 1087Mhz, 133.5TFlops peak FP16)

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.19685 ms, 107.657 TFlops, 78.8501 GB/s

README.md

Instructions for example_gemm_xdl

Run example_gemm_xdl

Instructions for `example_gemm_xdl`

Run `example_gemm_xdl`