blis/kernels at fde812015fc04c213eeb00a8fb8d1472eff32bea - blis

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-05 20:23:58 +00:00

Files

Vignesh Balasubramanian 808d79a610 Implemented efficient ZGEMM algorithm when k=1

Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:

- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.

The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.

AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847

2022-07-28 02:09:45 -04:00

armsve

New kernel set for Arm SVE using assembly (#396 )

2020-05-21 11:56:45 +05:30

armv7a

Squash-merge 'pr' into 'squash'. (#457 )

2020-11-14 09:39:48 -06:00

armv8a

avoid loading twice in armv8a gemm kernel (#403 )

2020-05-21 12:37:53 +05:30

bgq

Replaced use of bool_t type with C99 bool.