mirror of
https://github.com/amd/blis.git
synced 2026-06-05 20:23:58 +00:00
Problem statement : To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation. In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines: - Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api. - Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6. - Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance. - Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output. The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads. AMD-Internal: [CPUPL-2236] Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847