Files
blis/bench
mkadavil 3870792e62 Low precision gemm s32 downscale optimization.
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.

AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
2023-01-10 13:15:22 +05:30
..
2022-09-20 06:05:01 -04:00
2022-07-25 15:38:30 +00:00
2021-06-04 17:45:04 +05:30
2021-05-19 14:05:01 +05:30
2022-09-20 06:05:01 -04:00
2021-06-08 11:54:55 +05:30
2022-09-20 06:05:01 -04:00