Commit Graph

620 Commits

Author SHA1 Message Date
Vignesh Balasubramanian
ef545b928e Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
  present in the bli_gemv_unf_var2_amd.c file, as part of the
  bli_sgemv_unf_var2( ... ) function. This was changed to
  bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
  from 6 to 5 ), in accordance to the function pointer present
  in the zen3 and zen4 context files.

- Changed the accumulator type to double from float, inside the
  fringe loop for unit-strides(vectorized path) and non-unit strides
  (scalar code).

AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
2023-11-03 01:37:43 -04:00
mkadavil
d1844678f4 LPGEMM <u|s>8s8s16ou8 fixes for incorrect zero point addition.
-The zero point data type is different based on the downscale data
type. For int8_t downscale type, zero point type is int8_t whereas for
uint8_t downscale type, it is uint8_t. During downscale post-op, the
micro-kernels upscales the zero point from its data type (int8_t or
uint8_t) to that of the accumulation data type and then performs the
zero point addition. The accumulated output is then stored as downscaled
type in a later storage phase. For the <u|s>8s8s16 micro-kernels, the
upscaling to int16_t (accumulation type) is always performed assuming
the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction.
However this will result in incorrect upscaled zero point values if the
downscale type is uint8_t and the associated zero point type is also
uint8_t. This issue is corrected by switching between the correct
upscale instruction based on the zero point type.

AMD-Internal: [SWLCSG-2500]
Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076
2023-11-02 01:30:48 -04:00
Nallani Bhaskar
b3391ef5da Updated ERF threshold and packa changes in bf16
Description:
    1. Updated ERF function threshold from 3.91920590400 to 3.553
       to match with the reference erf float implementation which
       reduced errors a the borders and also clipped the output
       to 1.0
    2. Updated packa function call with pack function ptr in bf16
       api to avoid compilation issues for non avx512bf16 archs

    3. Updated lpgemm bench

    [AMD-Internal: SWLCSG-2423 ]

Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
2023-10-29 23:55:46 +05:30
Shubham Sharma
d45d1d68c6 Reset ZMM Registers before exiting, in L3 APIs
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.

AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
2023-10-27 00:51:04 -04:00
Harsh Dave
7bcb701b79 Fixed functionality failure for dgemm tiny kernel.
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.

- Corrected A and B matrix buffer offsets, for cases where k > KC.

AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
2023-10-26 15:11:59 +05:30
Edward Smyth
f5505be9f3 Merge commit 'e366665c' into amd-main
* commit 'e366665c':
  Fixed stale API calls to membrk API in gemmlike.
  Fixed bli_init.c compile-time error on OSX clang.
  Fixed configure breakage on OSX clang.
  Fixed one-time use property of bli_init() (#525).
  CREDITS file update.
  Added Graviton2 Neoverse N1 performance results.
  Remove unnecesary windows/zen2 directory.
  Add vzeroupper to Haswell microkernels. (#524)
  Fix Win64 AVX512 bug.
  Add comment about make checkblas on Windows
  CREDITS file update.
  Test installation in Travis CI
  Add symlink to blis.pc.in for out-of-tree builds
  Revert "Always run `make check`."
  Always run `make check`.
  Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script.   if the string contains zen and zen2, and zen need to be replaced with another string, then zen2   also be incorrectly replaced.
  Update POWER10.md
  Rework POWER10 sandbox
  Skip clearing temp microtile in gemmlike sandbox.
  Fix asm warning
  Sandbox header edits trigger full library rebuild.
  Add vhsubpd/vhsubpd.
  Fixed bugs in cpackm kernels, gemmlike code.
  Armv8A Rename Regs for Safe Darwin Compile
  Armv8A Rename Regs for Clang Compile: FP32 Part
  Armv8A Rename Regs for Clang Compile: FP64 Part
  Asm Flag Mingling for Darwin_Aarch64
  Added a new 'gemmlike' sandbox.
  Updated Fugaku (a64fx) performance results.
  Add explicit compiler check for Windows.
  Remove `rm-dupls` function in common.mk.
  Travis CI Revert Unnecessary Extras from 91d3636
  Adjust TravisCI
  Travis Support Arm SVE
  Added 512b SVE-based a64fx subconfig + SVE kernels.
  Replace bli_dlamch with something less archaic (#498)
  Allow clang for ThunderX2 config

AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
2023-10-18 09:09:54 -04:00
Vignesh Balasubramanian
81161066e5 Multithreading the DNRM2 and DZNRM2 API
- Updated the bli_dnormfv_unb_var1( ... ) and
  bli_znormfv_unb_var1( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Added the logic to distribute the job among the threads such
  that only one thread has to deal with fringe case(if required).
  The remaining threads will execute only the AVX-2 code section
  of the computational kernel.

- Added reduction logic post parallel region, to handle overflow
  and/or underflow conditions as per the mandate. The reduction
  for both the APIs involve calling the vectorized kernel of
  dnormfv operation.

- Added changes to the kernel to have the scaling factors and
  thresholds prebroadcasted onto the registers, instead of
  broadcasting every time on a need basis.

- Non-unit stride cases are packed to be redirected to the
  vectorized implementation. In case the packing fails, the
  input is handled by the fringe case loop in the kernel.

- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
  and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
  cases of size = 2 ( and ) size = 1 or non-unit strides respectively.

AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
2023-10-16 07:26:27 -04:00
Harsh Dave
7a4f84fbac Optimized dgemm for tiny input sizes.
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.

- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with  very
small dimensions.

- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.

- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.

-blis_dgemm_tiny function supports single threaded computation as of now.

AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
2023-10-16 05:52:49 -04:00
Harsh Dave
edbbfd9a86 Optimized AVX512 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

- Following kernels are modified:
bli_dgemmsup_rv_zen4_asm_24x4m,
bli_dgemmsup_rv_zen4_asm_24x3m,
bli_dgemmsup_rv_zen4_asm_24x2m,
bli_dgemmsup_rv_zen4_asm_24x1m,
bli_dgemmsup_rv_zen4_asm_24x1,
bli_dgemmsup_rv_zen4_asm_16x1,
bli_dgemmsup_rv_zen4_asm_8x1,
bli_dgemmsup_rv_zen4_asm_24x2,
bli_dgemmsup_rv_zen4_asm_16x2,
bli_dgemmsup_rv_zen4_asm_8x2,
bli_dgemmsup_rv_zen4_asm_24x3,
bli_dgemmsup_rv_zen4_asm_16x3,
bli_dgemmsup_rv_zen4_asm_8x3,
bli_dgemmsup_rv_zen4_asm_16x4,
bli_dgemmsup_rv_zen4_asm_8x4,
bli_dgemmsup_rv_zen4_asm_16x5,
bli_dgemmsup_rv_zen4_asm_8x5,
bli_dgemmsup_rv_zen4_asm_16x6,
bli_dgemmsup_rv_zen4_asm_8x6,
bli_dgemmsup_rv_zen4_asm_8x7,
bli_dgemmsup_rv_zen4_asm_8x8

AMD-Internal: [CPUPL-3574]
Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44
2023-10-16 04:03:56 -04:00
Shubham Sharma
9a2a4151ac Added improved ZTRSM AVX2 kernels
- Added 2x6 ZGEMM row-preferred kernel.
  - Kernel supports prefetch_a, prefetch_b,
    prefetch_a_next and prefetch_b_next.
  - Multiple Ways to prefetch c are supported.
  - prefetch_a and prefetch_c are enabled by
    default.
  - K loop is divided into multiple subloops for
    better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
  and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
  use of these kernels for TRSM in zen3 and
  zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.

AMD-Internal: [CPUPL-3781]

Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
2023-10-13 07:43:33 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Meghana Vankadari
eb5ab3f762 LPGEMM: Added transB support for bf16bf16f32o<bf16|f32> APIs
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
  to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
  column-major input format.
- Updated Early-return check conditions to account for trans
  parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
2023-10-12 23:36:18 +05:30
mkadavil
ea0324ab95 Multi data type downscaling support for u8s8s16 - u8s8s16<u8|s8>
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.

AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
2023-10-12 09:19:56 -04:00
Vignesh Balasubramanian
a6a67fea2d ZAXPBYV optimizations for handling unit and non-unit strides
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
  logic. The kernel performs two different sets of compute based
  on the value of alpha, for both unit and non-unit strides. There
  are no constraints on beta scaling of the 'y' vector.

- Updated the logic to support 'x' conjugate in the computation.
  The kernel supports conjugate/no conjugate operation through the
  usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
  intrinsics.

- Updated the early return condition in the kernel to adhere to
  the standard compliance.

- Updated the scalar computation with vector computation(using 128
  bit registers), in case of dealing with a single element(fringe case)
  in unit-stride or vectors with non-unit strides. A single dcomplex
  element occupies 128 bits in memory, thereby providing scope for
  this optimization.

- Added accuracy and extreme value testing with sufficient sizes
  and initializations, to test the required main and fringe cases
  of the computation.

AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
2023-10-12 06:31:08 -04:00
bhaskarn
5fd24c27a7 Updated expf max min precission fix nan issue in Tanh
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.

Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.

AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
2023-10-12 01:04:59 -04:00
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00
mkadavil
c3b97559c1 Zero Point support for <u|s>8s8s<32|16>os8 LPGEMM APIs
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.

AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
2023-10-10 12:05:47 +05:30
Harsh Dave
df80f40ccd Fixed incorrect ymm registers usage in FMA operation.
- Incorrect ymm registers were used in dgemm SUP edge kernel,
    while computing FMA operation.

- Due to incorrect vector register, it resulted into incorrect result.

- Corrected vector registers usage for FMA operation.

AMD-Internal: [CPUPL-3964]

Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df
2023-10-02 03:20:44 -04:00
Arnav Sharma
f0416cff08 SGEMM SUP Panel Stride Bug Fix
- The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the
  use of panel strides in case of packed matrices, thus resulting in
  incorrect matrix strides when packing was explicitly enabled using
  BLIS_PACK_A=1, BLIS_PACK_B=1 or both.
- The kernels are updated to use panel strides for traversing both A
  and B matrix buffers accurately.

[AMD-Internal]: CPUPL-3673
Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb
2023-09-27 03:02:24 -04:00
Harsh Dave
e437469a99 Optimized AVX2 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
2023-09-22 03:43:47 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Harihara Sudhan S
278ca71706 Fixes for GEMV Functionality Issues
- Added call to dsetv in dscalv. When DSCALV is invoked by
  DGEMV the SCAL function is expected to SET the vector to
  zero when alpha is 0. This change is done to ensure BLAS
  compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
  1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.

AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
2023-08-14 13:54:48 +05:30
Harihara Sudhan S
03fa660792 Optimized xGEMV for non-unit stride X vector
- In GEMV variant 1, the input matrix A is in row major. X vector
  has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
  operation inside the kernel has been ensured by packing the input X
  vector to a temporary buffer with unit stride. Currently, the
  packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
  In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
  is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
  to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
2023-08-08 01:01:22 -04:00
Edward Smyth
c445f192d5 BLIS: Missing clobbers (batch 6)
More missing clobbers in skx and zen4 kernels, missed in
previous commits.

AMD-Internal: [CPUPL-3521]
Change-Id: I838240f0539af4bf977a10d20302a40c34710858
2023-08-07 10:52:23 -04:00
Harihara Sudhan S
3be43d264f Optimized xGEMV for non-unit stride Y vector
- In variant 2 of GEMV, A matrix is in column major. Y vector has
  to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
  GEMV operation inside the kernel has been ensured by packing the
  input Y vector to a temporary buffer with unit stride. As part of
  the packing Y is scaled by beta to reduce the number of times Y
  vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
  buffer are copied to the original buffer and the temporary one is
  released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
  to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
2023-08-07 08:12:44 -04:00
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Eleni Vlachopoulou
9c613c4c03 Windows CMake bugfix in object libraries for shared library option
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.

AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
2023-05-24 17:30:16 +05:30
Edward Smyth
dea5fe4d12 BLIS: Missing clobbers (batch 5)
Add missing clobbers for AVX512 mask registers k0-k7
in zen4 kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I5f28c725d7af1466df4db4cdfa2d456bbc6ab36d
2023-05-23 15:40:29 -04:00
Edward Smyth
a3adfb68cf BLIS: Missing clobbers (batch 4)
Add missing clobbers haswell (sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc
2023-05-23 14:57:08 -04:00
Edward Smyth
03965a4f07 BLIS: Missing clobbers (batch 3)
Add missing clobbers in haswell (non-sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I68f6ad0c01557fcde73b1775d250d48b5162c521
2023-05-23 14:37:31 -04:00
Edward Smyth
e960141fe2 BLIS: Missing clobbers (batch 2)
Add missing clobbers in other zen4 kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I5cceb44fe100e03269cfe21d8c4c0d2171b921c3
2023-05-23 13:12:20 -04:00
Edward Smyth
ea2eea5097 BLIS: Missing clobbers (batch 1)
Add missing clobbers in first batch of assembly kernels:
- zen3 bli_gemmsup*
- bli_zgemm_zen4_asm_12x4
- bli_gemmsup_rv_haswell_asm_sMx6

AMD-Internal: [CPUPL-3456]
Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1
2023-05-23 11:51:18 -04:00
Mangala V
5f5bc24989 Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().

Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.

AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
2023-05-21 23:13:46 +05:30
eashdash
2c4f032e0f Fix for lack of BF16 instruction when compiled with GCC-11
GCC-11 and below support AVX512-BF16.
However, it doesn't support all the bf16 instructions required.

For bf16 downscale APIs, when beta scaling is done, C output
elements must be upscaled from BF16 type to Float type for
beta scaling operation.

For this upscaling operation of bf16 to float,
_mm512_cvtpbh_ps is used.

This however is not supported by GCC-11 and below
(but is supported on GCC 12 onwards)

Lack of this instruction support in gcc11, and below leads to
compilation issues with this instruction (_mm512_cvtpbh_ps)
not being recognized.

To fix, this, we use a set of instructions:
1. register containing bf16 type
   __m256bh a1
2. Convert bf16 to float with shift left ops
   __m512 float_a1 = (__m512)
   (_mm512_sllv_epi32
   (_mm512_cvtepi16_epi32 ((__m256i) a1), _mm512_set1_epi32 (16)));

AMD-Internal: [CPUPL-3454]
Change-Id: Ie4a9f04881c59ced088608633774b27f22b4ab8e
2023-05-19 10:15:08 +00:00
eashdash
061a68ff0d BF16 Downscale and Performance fix for bf16 API
This change contains the following:

1. Downscale optimization fix
   a. Similar to downscale optimizations made for s32 and s16 gemm,
      the following optimizations are done to improve the downscale
      performance for BF16 gemm
   b. The store to temporary float buffer can be avoided when k < KC
      since intermediate accumulation will not be required for the
      pc loop (only 1 iteration). The downscaled values (bf16) are
      written directly to the output C matrix.
   c. Within the micro-kernel when beta != 0, the bf16 data from the
      original C output matrix is loaded to a register, converted to
      float and beta scaling is applied on it at register level.
      This eliminates the requirement of previous design of copying the
      bf16 value to the temporary float buffer inside jc loop.

2. Alpha scaling
   a. Alpha scaling (multiply instruction) by default was resulting in
      performance regression when k dimension is small and alpha=1 in
      bf16 micro-kernels.
   b. Alpha scaling is now only done when alpha != 1.

3. K Fringe optimization
   a. Previously memcpy was used for K fringe case to load elements
      from A matrix in the microkernels
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs without the need to use
      memcpy functions

4. N LT-16 fringe optimization
   a. Previously memcpy was used for N LT 16 fringe case in the
      microkernelsfor storing the downscaled and non-downscaled output.
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs of BF16 without the need to use
      memcpy functions

5. Framework updates to avoid unnecessary pack buffer allocation
   a. The default allocation of the temporary pack buffer is removed
      and the pack buffer is now only allocated if k > KC.

AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044
2023-05-18 10:02:56 -04:00
Shubham Sharma
26e120ea25 Fixed diagonal packing for C/Z TRSM small
- In C/Z TRSM small, packing in case of unit diagonal
  is not handled properly.
- Diagonal elements are still being read even in case of
  unit diagonal.
- This causes "Conditional jump or move depends on
  uninitialised value" error during valgrind tests.
- To fix this, diagonal elements should not be read
  in case of unit diagonal.

AMD-Internal: [CPUPL-3406]
Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e
2023-05-18 07:57:21 -04:00
Eleni Vlachopoulou
1a7f60ff5b Update CMake system to use object libraries for haswell, skx and zen4.
- AVX2 and AVX512 flags are set up locally for each object library that requires them.
- Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally.
- To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library.

AMD-Internal: [CPUPL-3241]
Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73
2023-05-12 10:04:16 -04:00
Harsh Dave
30b931ae60 Fixed compilation error due to inconsistent compiler behavior towards AVX512 zero masking instruction syntax
- Since the code used whitespace variant of AVX512 mask instruction. But some compilers
accept whitespace variant and some don't - to be safe, we removed whitespace.

- Whitespace variant of masked instruction "vmovupd    (%rax,%r8,1),%zmm8{%k2} {z}" is replaced with
  this instruction "vmovupd    (%rax,%r8,1),%zmm8{%k2}{z}" to resolve the compilation failure issue.

- Thanks to Shubham Sharma<shubham.sharma3@amd.com> for identifying issue.

AMD-Internal: [CPUPL-1963]

Change-Id: I290589132e8cce25cab0d1e4c195a7dd0a014937
2023-05-12 06:16:15 -04:00
mkadavil
b167e47091 LPGEMM frame and micro-kernel updates to fix gcc9.4 compilation issue.
-Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were
introduced in later versions of gcc (>10) in addition to already
existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to
support compilation using gcc 9.4, either the masked intrinsic or other
gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512)
in LPGEMM Zen4 micro-kernels.
-Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32)
needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc
9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels
for BF16 LPGEMM is excluded from compilation based on GNUC macro.

AMD-Internal: [CPUPL-3396]
Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7
2023-05-11 18:00:18 +05:30
Mangala V
7739a3fbfe Bug fix for 4xk AVX512 packing kernel
Few tests failed on windows OS as some registers were not added as part
of cobbler list

Updated below registers into clobber list:
In function bli_zpackm_zen4_asm_12xk : ZMM12-ZMM15
In function bli_zpackm_zen4_asm_4xk : ZMM4-ZMM7

AMD-Internal: [CPUPL-3253]

Change-Id: I3e42130bf1a3b48717c4b437179ae3f116e5cf1d
2023-05-05 04:15:25 +05:30
Eleni Vlachopoulou
bf26b8ffbc Removing /arch:AVX2 flag from-high level CMake
- Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags.
 - Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc.
 - This is a quick solution to enable runs on non-AVX2 machines.

AMD-Internal: [CPUPL-3241]
Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650
2023-04-28 09:22:13 -04:00
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
0f0277e104 Code cleanup: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.

AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
2023-04-21 08:41:16 -04:00
eashdash
a72fff2be9 Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
   Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
   sum of every column of B matrix to implement the s8s8s16 operation
   using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
   Here the A matrix elements are converted from int8 to uint8 (s16 FMA
   works with A matrix type uint8 only) by adding extra 128 to
   every A matrix element
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
   Final C = C - ( (sum of column of B matrix) * 128 )
   This is done to compensate for the addition of extra 128 to every
   A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.

AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
2023-04-21 05:30:38 -04:00
mkadavil
3572baa9d3 aocl_softmax_f32 api's for softmax computation as part of lpgemm.
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
2023-04-21 05:26:08 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
mkadavil
99d10c3f88 Low precision gemm u8s8s16 downscale optimization.
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.

AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
2023-04-19 06:40:06 -04:00