mirror of
https://github.com/amd/blis.git
synced 2026-05-13 10:35:38 +00:00
-As part of an earlier optimization, the memcpy function call in k fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe (n < 16 - beta scale and C store) were replaced with copy macros specifically optimized for less than 4 and 16 elements each. However upon further analysis it was observed that masked load/broadcast and masked store performed better on average than the copy macros. The copy macros contained more if conditions, which resulted in more branching and thus resulting in perf variations. It was also noted that code generation varied a lot based on the compilers when using the copy macros due to the extra conditional code. -As part of this change, the copy macros are completely replaced with masked load/broadcast/store. Performance was observed to be better and less prone to variations for the k fringe and n fringe (< 16) cases. AMD-Internal: [CPUPL-3173] Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503