mirror of
https://github.com/amd/blis.git
synced 2026-05-12 01:59:59 +00:00
-The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c