Commit Graph

19 Commits

Author SHA1 Message Date
mkadavil
9bc59cc500 Low Precision GEMM framework fixes for downscaling.
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.

AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
2022-09-13 07:42:29 -04:00
Harihara Sudhan S
5b6cc5d39d Bug fix in s16 downscale operation
- Store operations was done to c matrix and not to c buffer

AMD-Internal:[CPUPL-2171]

Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa
2022-09-13 06:01:48 -04:00
eashdash
e1349c0c71 LPGEMM BF16 MT panel based balancing
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.

AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
2022-09-07 03:20:50 -04:00
eashdash
32a9e735f1 BF16 Output downscaling functionality
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.

Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
2022-08-30 13:46:09 -04:00
Harihara Sudhan S
5faab43e66 Downscaling as part of u8s8s16os16
- int16 c matrix intermediate values are converted to int32,
	  then the int32 values are converted to fp32. On these fp32
	  values scaling is done
	- The resultant value is down scaled to int8 and stored in a
	  separate buffer

AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
2022-08-30 13:41:36 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
eashdash
e674fae758 Post-Ops for bf16bf16f32
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.

- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support

Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
2022-08-30 08:14:14 +00:00
mkadavil
a7d1cc7369 Multi-Threading support for BFloat16 gemm.
-OpenMP based multi-threading support added for BFloat16 gemm.
Both gemm and reorder api's are parallelized.
-Multi-threading support for u8s8s16 reorder api.
-Typecast issues fixed for bfloat16 gemm kernels.

AMD-Internal: [CPUPL-2459]
Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109
2022-08-30 02:54:19 -04:00
Harihara Sudhan S
326d8a557f Performance regression in u8s8s16os16
- Performance of u8s8s16os16 came down by 40% after the
	  introduction of post-ops
	- Analysis revealed that the target compiler assumed false
	  dependency and was generating sub-optimal code due to the
          post-ops structure
	- Inserted vzeroupper to hint the compiler that no ISA change
	  will occur

AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
2022-08-29 03:20:02 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30
Harihara Sudhan S
d1eaf65a26 Post-Ops for u8s8s16os16
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.

	- Post-ops relu and bias added to all the compute kernels
	  of u8s8s16os16
	- Post-ops are done on the value loaded into the register
	  to avoid reloading of C matrix elements
	- Minor bug fixes in openmp thread decorator of lpgemm
	- Added test cases to lpgemm bench input file

AMD-Internal: [CPUPL-2171]

Change-Id: If49f763fdfac19749f6665c172348691165d8631
2022-08-09 14:52:41 +05:30
Harihara Sudhan S
60de0a1856 Multithreading and support for unpacked B matrix in u8s8s16os16
Fucntionality - When the B matrix is not reordered before the
u8s8s16os16 compute kernel call packing of B matrix is done as
part of the five loop algorithm. The state of B matrix (packed
or unpacked) is given as an user input.

	- Packing of B matrix is done as part of the five loop
	  compute.
	- Temprorary buffer for pack B is allocated in the five
	  loop algorithm
	- Multithreading for computation kernel
	- Configuration constants for u8s8s16os16 are part of the
	  lpgemm config

AMD-Internal: [CPUPL-2171]

Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8
2022-08-05 19:28:37 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30
Harihara Sudhan S
e5d4fc2a70 Added low precision GEMM (u8s8s16os16)
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.

	- The compute kernel will perform computation only
	  if B matrix reordered to suit the usage of AVX2
	  instruction vpmaddubsw.
	- Kernel for packing the B matrix is provided.
	- LPGEMM bench code was modified to test the
	  performance and accuracy of the new variant.

AMD-Internal: [CPUPL-2171]

Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
2022-08-02 02:20:00 -04:00
mkadavil
f63e699c08 Fix for segmentation fault in low precision gemm.
- Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL
when compiled without open mp threading support. Subsequently the code
is executed as if it is single-threaded. However, when B matrix needs
to be packed, communicators are required (irrespective of single or
multi-threaded), and the code accesses lpgemm_thrinfo_t for the same
without NULL check. This results in seg fault.
For the fix, a non-open mp thread decorator layer is added, which
creates a placeholder lpgemm_thrinfo_t object with a communicator before
invoking the 5 loop algorithm. This object will be used for packing.

- Makefile for compilation of aocl_gemm bench.

AMD-Internal: [CPUPL-2304]
Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc
2022-07-21 11:51:40 +05:30
mkadavil
6c112632a7 Low precision gemm integrated as aocl_gemm addon.
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.

AMD-Internal: [CPUPL-2102]

Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
2022-06-09 10:28:38 -04:00
Field G. Van Zee
7a0ba4194f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
2022-03-31 12:03:27 +05:30