mirror of
https://github.com/amd/blis.git
synced 2026-05-12 10:05:38 +00:00
- For edge kernels which handles the corner cases and specially for cases where there is really small amount of computation to be done, executing FMA efficiently becomes very crucial. - In previous implementation, edge kernels were using same, limited number of vector register to hold FMA result, which indirectly creates dependency on previous FMA to complete before CPU can issue new FMA. - This commit address this issue by using different vector registers that are available at disposal to hold FMA result. - That way we hold FMA results in two sets of vector registers, so that sub-sequent FMA won't have to wait for previous FMA to complete. - At the end of un-rolled K loop these two sets of vector registers are added together to store correct result in intended vector registers. AMD-Internal: [CPUPL-3574] Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54