Files
composable_kernel/library/src
Bartlomiej Wroblewski 9f2d90a8b6 Optimize fp16 direct load GEMM instances (#1086)
This PR optimizes fp16 instances of direct load GEMM kernel introduced in #999 and #1052.

Measured the performance of new instances on CDNA2 GPU and compared it against the performance of the best non-direct-load GEMM instances. Used 76 different GEMM problems.
On average, this change improves the performance of the tested problems by 47%. For cases known as latency-bound, the speedup is around 126%.

[ROCm/composable_kernel commit: ad0a8e4cd2]
2023-12-18 11:09:10 +01:00
..