Commit Graph

34 Commits

Author SHA1 Message Date
mkadavil
3d74b62e60 Lpgemm threading and micro-kernel optimizations.
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.

AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
2023-03-16 11:44:51 +05:30
eashdash
e36f699939 Implemented ERF Based GeLU Activation for LPEGMM and SGEMM
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
   error function - ERF
2. Added error function based GeLU activation post-ops for the
   S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
   addition to adding the marco based function definations of the
   ERF function in the math-utils and gelu headerfiles.

AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
2023-03-13 06:10:31 -04:00
mkadavil
8dff49837d Lpgemm source restructuring to support amdzen config.
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.

AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
2023-02-21 08:35:38 -05:00
bhaskarn
91a9968a5e Developed intrinsic based f32 kernels in lpgemm
Description:

1. Developed row variant intrinsic Kernels for float32/sgemm
   which are called from lpgemm api aocl_gemm_f32f32f32of32()

2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
   developed using avx512.

3. 6x16m main kernel and respective n fringe and mn fringe are
   are developed based on avx2 and avx

4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
   dispatch are planned next

5. When leading dims are greater than dims bench_lpgemm need
   to be updated to test it and this is planned next.

Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
2023-02-20 01:11:22 -05:00
mkadavil
63ee4c5e4c Remove memcpy usage in u8s8s32 lpgemm micro kernels.
-As of now, memcpy is used in u8s8s32 micro-kernel for copying in k
fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for
small k/n dimensions, memcpy invocation has high overhead.
-This issue is fixed by replacing memcpy with a MACRO based
implementation of copy routine, specifically optimized for the sizes
that will be encountered in fringe cases (k < 4, NR' < 16).

AMD-Internal: [CPUPL-3008]
Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc
2023-02-16 05:52:19 -05:00
eashdash
672544bc04 GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16
1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16
2. Changes are done at frame and micro-kernel level to
   implement this post-op.
3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF
   functions are implemented for the GeLU post-operation.
4. TANH and EXPF math functions are efficiently implemented in
   macro-based fashion to exploit register level fusion of GeLU
   with GEMM operations for improved performance
5. LPGEMM bench is changed to pass GeLU post-op as input and
   support accuracy check to verify functional correctness

AMD-Internal: [CPUPL-2978]
Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28
2023-02-02 08:25:04 -05:00
mkadavil
4b5e24d0d9 Column major input support for f32 gemm (sgemm for lpgemm).
-The f32 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.

AMD-Internal: [CPUPL-2919]
Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3
2023-01-16 04:04:21 -05:00
mkadavil
3870792e62 Low precision gemm s32 downscale optimization.
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.

AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
2023-01-10 13:15:22 +05:30
Harihara Sudhan S
11c42ce1d3 C matrix prefetch for BF16 GEMM
- Broke down the KR loop inside the compute kernel into
	  two pieces
	- Added C matrix prefetch between the two decomposed
          pieces of KR loop

AMD-Internal: [CPUPL-2693]
Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e
2022-11-21 04:57:19 -05:00
Harihara Sudhan S
2f6c26eff6 Dynamic block tuning for LPGEMM u8s8s16os16
- NC value is adjusted based on the n dimension of the B
	  matrix.
	- KC value is adjusted based on the k dimension input
	  matrices.
	- In order to ensure a handshake between the reorder
          function and compute function, the same dynamic block
	  tuning logic is called in both places.
	- Reorder followed by compute scenarios mean that tuning
	  KC value based on MC value and vice versa is not feasible.

AMD-Internal: [CPUPL-2638]
Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429
2022-11-21 01:39:35 -05:00
eashdash
63864d7dfb Added clipping while downscaling for u8s8s32os8 and u8s8s16os8.
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]

AMD-Internal: [LWPZENDNN-493]

Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
2022-10-11 07:28:06 -04:00
mkadavil
f4702debb9 Zen4 compilation flag updates to support low precision gemm.
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.

AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
2022-09-29 08:19:40 -04:00
eashdash
d21cd51fde Accumulation type for alpha, beta values and BF16 bench integration
1. Correcting the type of alpha, and beta values from C_type
   (output type) to accumulation type.
   For the downscaled LPGEMM APIs, C_type will be the downscaled
   type but the required type for alpha and beta values should
   be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
   (bf16bf16f32of32 and bf16bf16f32obf16) APIs

AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
2022-09-23 05:00:49 -04:00
Harihara Sudhan S
a45827b3f9 u8s8s16os16 bug fix for downscale operation
- Removed some read code from the macros for downscale
	- Store permute correction
	- Simplified macros for edge cases and corrected intermediate
	  operation

AMD-Internal:[CPUPL-2171]

Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68
2022-09-22 05:02:17 -04:00
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
mkadavil
9bc59cc500 Low Precision GEMM framework fixes for downscaling.
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.

AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
2022-09-13 07:42:29 -04:00
Harihara Sudhan S
5b6cc5d39d Bug fix in s16 downscale operation
- Store operations was done to c matrix and not to c buffer

AMD-Internal:[CPUPL-2171]

Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa
2022-09-13 06:01:48 -04:00
eashdash
e1349c0c71 LPGEMM BF16 MT panel based balancing
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.

AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
2022-09-07 03:20:50 -04:00
eashdash
32a9e735f1 BF16 Output downscaling functionality
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.

Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
2022-08-30 13:46:09 -04:00
Harihara Sudhan S
5faab43e66 Downscaling as part of u8s8s16os16
- int16 c matrix intermediate values are converted to int32,
	  then the int32 values are converted to fp32. On these fp32
	  values scaling is done
	- The resultant value is down scaled to int8 and stored in a
	  separate buffer

AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
2022-08-30 13:41:36 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
eashdash
e674fae758 Post-Ops for bf16bf16f32
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.

- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support

Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
2022-08-30 08:14:14 +00:00
mkadavil
a7d1cc7369 Multi-Threading support for BFloat16 gemm.
-OpenMP based multi-threading support added for BFloat16 gemm.
Both gemm and reorder api's are parallelized.
-Multi-threading support for u8s8s16 reorder api.
-Typecast issues fixed for bfloat16 gemm kernels.

AMD-Internal: [CPUPL-2459]
Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109
2022-08-30 02:54:19 -04:00
Harihara Sudhan S
326d8a557f Performance regression in u8s8s16os16
- Performance of u8s8s16os16 came down by 40% after the
	  introduction of post-ops
	- Analysis revealed that the target compiler assumed false
	  dependency and was generating sub-optimal code due to the
          post-ops structure
	- Inserted vzeroupper to hint the compiler that no ISA change
	  will occur

AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
2022-08-29 03:20:02 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30
Harihara Sudhan S
d1eaf65a26 Post-Ops for u8s8s16os16
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.

	- Post-ops relu and bias added to all the compute kernels
	  of u8s8s16os16
	- Post-ops are done on the value loaded into the register
	  to avoid reloading of C matrix elements
	- Minor bug fixes in openmp thread decorator of lpgemm
	- Added test cases to lpgemm bench input file

AMD-Internal: [CPUPL-2171]

Change-Id: If49f763fdfac19749f6665c172348691165d8631
2022-08-09 14:52:41 +05:30
Harihara Sudhan S
60de0a1856 Multithreading and support for unpacked B matrix in u8s8s16os16
Fucntionality - When the B matrix is not reordered before the
u8s8s16os16 compute kernel call packing of B matrix is done as
part of the five loop algorithm. The state of B matrix (packed
or unpacked) is given as an user input.

	- Packing of B matrix is done as part of the five loop
	  compute.
	- Temprorary buffer for pack B is allocated in the five
	  loop algorithm
	- Multithreading for computation kernel
	- Configuration constants for u8s8s16os16 are part of the
	  lpgemm config

AMD-Internal: [CPUPL-2171]

Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8
2022-08-05 19:28:37 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30
Harihara Sudhan S
e5d4fc2a70 Added low precision GEMM (u8s8s16os16)
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.

	- The compute kernel will perform computation only
	  if B matrix reordered to suit the usage of AVX2
	  instruction vpmaddubsw.
	- Kernel for packing the B matrix is provided.
	- LPGEMM bench code was modified to test the
	  performance and accuracy of the new variant.

AMD-Internal: [CPUPL-2171]

Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
2022-08-02 02:20:00 -04:00
mkadavil
f63e699c08 Fix for segmentation fault in low precision gemm.
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.

- Makefile for compilation of aocl_gemm bench.

AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
2022-07-21 11:51:40 +05:30
mkadavil
6c112632a7 Low precision gemm integrated as aocl_gemm addon.
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.

AMD-Internal: [CPUPL-2102]

Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
2022-06-09 10:28:38 -04:00
Field G. Van Zee
7a0ba4194f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
2022-03-31 12:03:27 +05:30