mirror of
https://github.com/amd/blis.git
synced 2026-05-11 17:50:00 +00:00
-As of now, memcpy is used in u8s8s32 micro-kernel for copying in k fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for small k/n dimensions, memcpy invocation has high overhead. -This issue is fixed by replacing memcpy with a MACRO based implementation of copy routine, specifically optimized for the sizes that will be encountered in fringe cases (k < 4, NR' < 16). AMD-Internal: [CPUPL-3008] Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc