blis/addon at 5f5bc2498937d7ac5a64ff97fb48464e4dc4005a - blis

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 18:15:37 +00:00

Files

eashdash 061a68ff0d BF16 Downscale and Performance fix for bf16 API

This change contains the following:

1. Downscale optimization fix
   a. Similar to downscale optimizations made for s32 and s16 gemm,
      the following optimizations are done to improve the downscale
      performance for BF16 gemm
   b. The store to temporary float buffer can be avoided when k < KC
      since intermediate accumulation will not be required for the
      pc loop (only 1 iteration). The downscaled values (bf16) are
      written directly to the output C matrix.
   c. Within the micro-kernel when beta != 0, the bf16 data from the
      original C output matrix is loaded to a register, converted to
      float and beta scaling is applied on it at register level.
      This eliminates the requirement of previous design of copying the
      bf16 value to the temporary float buffer inside jc loop.

2. Alpha scaling
   a. Alpha scaling (multiply instruction) by default was resulting in
      performance regression when k dimension is small and alpha=1 in
      bf16 micro-kernels.
   b. Alpha scaling is now only done when alpha != 1.

3. K Fringe optimization
   a. Previously memcpy was used for K fringe case to load elements
      from A matrix in the microkernels
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs without the need to use
      memcpy functions

4. N LT-16 fringe optimization
   a. Previously memcpy was used for N LT 16 fringe case in the
      microkernelsfor storing the downscaled and non-downscaled output.
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs of BF16 without the need to use
      memcpy functions

5. Framework updates to avoid unnecessary pack buffer allocation
   a. The default allocation of the temporary pack buffer is removed
      and the pack buffer is now only allocated if k > KC.

AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044

2023-05-18 10:02:56 -04:00

aocl_gemm

BF16 Downscale and Performance fix for bf16 API

2023-05-18 10:02:56 -04:00

gemmd

Added support for addons.

2022-03-31 12:03:27 +05:30