Files
blis/kernels
Harsh Dave e437469a99 Optimized AVX2 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
2023-09-22 03:43:47 -04:00
..
2023-08-07 10:52:23 -04:00
2023-08-21 07:01:38 -04:00
2021-04-27 11:09:48 +05:30
2020-07-22 18:24:26 +05:30