blis/kernels at f96e20b8940dad53c2cfc2097d4eda444ee1b04e - blis

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 10:05:38 +00:00

Files

Harsh Dave e437469a99 Optimized AVX2 DGEMM SUP edge kernels

- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54

2023-09-22 03:43:47 -04:00

armsve

New kernel set for Arm SVE using assembly (#396 )

2020-05-21 11:56:45 +05:30

armv7a

Squash-merge 'pr' into 'squash'. (#457 )

2020-11-14 09:39:48 -06:00

armv8a

avoid loading twice in armv8a gemm kernel (#403 )

2020-05-21 12:37:53 +05:30

bgq

Replaced use of bool_t type with C99 bool.