mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-11 17:00:18 +00:00
This PR optimizes fp16 instances of direct load GEMM kernel introduced in #999 and #1052. Measured the performance of new instances on CDNA2 GPU and compared it against the performance of the best non-direct-load GEMM instances. Used 76 different GEMM problems. On average, this change improves the performance of the tested problems by 47%. For cases known as latency-bound, the speedup is around 126%.