mirror of
https://github.com/amd/blis.git
synced 2026-05-12 18:15:37 +00:00
This change contains the following:
1. Downscale optimization fix
a. Similar to downscale optimizations made for s32 and s16 gemm,
the following optimizations are done to improve the downscale
performance for BF16 gemm
b. The store to temporary float buffer can be avoided when k < KC
since intermediate accumulation will not be required for the
pc loop (only 1 iteration). The downscaled values (bf16) are
written directly to the output C matrix.
c. Within the micro-kernel when beta != 0, the bf16 data from the
original C output matrix is loaded to a register, converted to
float and beta scaling is applied on it at register level.
This eliminates the requirement of previous design of copying the
bf16 value to the temporary float buffer inside jc loop.
2. Alpha scaling
a. Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in
bf16 micro-kernels.
b. Alpha scaling is now only done when alpha != 1.
3. K Fringe optimization
a. Previously memcpy was used for K fringe case to load elements
from A matrix in the microkernels
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs without the need to use
memcpy functions
4. N LT-16 fringe optimization
a. Previously memcpy was used for N LT 16 fringe case in the
microkernelsfor storing the downscaled and non-downscaled output.
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs of BF16 without the need to use
memcpy functions
5. Framework updates to avoid unnecessary pack buffer allocation
a. The default allocation of the temporary pack buffer is removed
and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044