Commit Graph

802 Commits

Author SHA1 Message Date
Vignesh Balasubramanian
06d776b025 AVX512 ZGEMM SUP Inner product kernels
- Implemented a set of column preferential dot-product based
  ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
  These kernels perform matrix multiplication as a sequence
  of inner products(i.e, dot-products).

- These standalone kernels are expected to strictly handle
  the CRC storage scheme for C, A and B matrices. RRC is also
  supported through operation transpose, at the framework
  level.

- Added unit-tests to test all the kernels(main and fringe),
  as well as the redirection between these kernels.

AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
2024-11-06 04:18:57 -05:00
Nallani Bhaskar
3e6f98f046 Bug fixes f32 to bf16 reorder API
Description:

_mm512_cvtne2ps_pbh(a, b) instruction takes
b when j<16 but the code was developed in
with assuming reverse order.

Fixed some indentation issues

Changed the file name and made it uniform

Change-Id: I7b45b4c35931d8febde7b7b5d9604ea953046f97
2024-11-05 12:58:15 +00:00
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
Deepak Negi
6ddde390aa Added support for column major B matrix in BF16S4F32F32 reorder API.
-Added new pack kernels that packs/reorders B matrix (odd strides) from
 column-major input format. This also supports the transB scenario if
 input B matrix is row major.

Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853
2024-10-22 03:01:39 -04:00
varshav2
dabfdf484a Add Scale post-op for F32 API
- Implemented the Scale post-op for the F32 API for all kernels
 - f32_scale = (f32 * scale_factor) + offset
 - Added the bench inputs

Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
2024-10-21 06:08:31 -04:00
Shubham Sharma.
2b02024404 Fixed bug in bli_zdotv_zen4_asm_avx512 kernel.
-  Data-type of n, and conj is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(n)),
   the "n" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.
Thanks to mangala.v@amd.com for finding this bug.

Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5
2024-09-24 02:33:44 -04:00
Shubham Sharma
1833ee70cd Fixed bug in bli_dgemm_avx512 8x24 native kernel.
-  Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
   the "m" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.

AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
2024-09-23 04:54:04 -04:00
Deepak Negi
16653ed208 Added support for column major B matrix in BF16S4F32F32 reorder API.
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.

Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
2024-09-20 01:11:21 +05:30
varshav2
605517964b Add Transpose Kernel for A matrix in F32F32f32Of32
- Implemented the AVX512 packA kernel for col major inputs in F32 API
 - Removed the work arounds for n = 1,  mtag_a = PACK case, where the execution was
   being directed to GEMM instead of GEMV.

Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
2024-09-19 06:37:11 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Vignesh Balasubramanian
ed682a429d Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels
- Added explicit typecast to the pointers that are passed
  to the _mm_prefetch( ... ) intrinsic, to avoid compiler
  warnings.

AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
2024-09-12 11:37:06 +05:30
Meghana Vankadari
5120f98e12 Developed all WoQ kernels for bf16s4f32o<f32|bf16>
Description:

1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
   weights into bf16 performed in the kernel itself to reduce bandwidth.

2. These kernels are performing better compared to bf16 weights when m
   is small and n is large.

3. Established a threshold to do quantization support at packing of
   B (KCXNC) level or WoQ kernel level.

Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
2024-09-10 12:00:10 +00:00
varshav2
298a165718 Add TransA and TransB support for F32F32F32oF32
- Added support for TransA and transB in f32f32of32 APIs
 - Modified the GEMV case(m == 1) to support PACKB feature
 - Redirecting the operations to GEMM instead of GEMV in case of n == 1
   conditions, with storage scheme r/transA and c/transB to avoid the
   packing errors which would lead to failures in computation.

Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
2024-09-09 05:19:49 +05:30
Meghana Vankadari
687abe4c96 Bug fix in WOQ kernel for m=4 case.
- Updated pre_op_off computation for nr0 < NR cases.
- Fixed warnings in bench file.

Change-Id: Iae30fa84b6b47ebd94ab05d2139056aee24546d7
2024-09-05 05:00:30 +00:00
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
Deepak Negi
6dcf500703 Element wise operations API for float(f32) input matrix in LPGEMM.
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24

AMD Internal: [SWLCSG-2947]

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
2024-08-27 03:28:52 -04:00
Chandrashekara K R
2ff0125f11 CMake: Enabled ADDON(aocl_gemm) feature for Windows.
1. Updated datatype from __int64_t to int64_t. Since
   __int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows

Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
2024-08-23 07:05:28 -04:00
Vignesh Balasubramanian
93631410a3 Bugfix : Fixed memory accesses in AVX512 SGEMMSUP RD kernels
- Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling
  m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction
  that strictly needed YMMs.

- Further updated the existing micro-kernel test cases to simulate
  these issues and validate the fix.

AMD-Internal: [CPUPL-5353]
Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1
2024-08-19 17:18:31 +05:30
Vignesh Balasubramanian
fd47d0aebb Fixing compiler warnings when exporting internal BLIS kernels
- Added the attribute to export symbols, in the header file that
  contains the L1 kernel declarations. This attribute was previously
  added as part of the kernel definitions.

AMD-Internal: [CPUPL-4415]
Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710
2024-08-12 12:30:51 +05:30
Meghana Vankadari
5514c7a75f Added LPGEMV(n=1) kernels for s8s8s32os32|s8 and s8s8s16os16|s8 APIs
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n==1.

AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
2024-08-09 06:17:52 -04:00
Edward Smyth
7fff7b4026 Code cleanup: Miscellaneous fixes
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
  in frame/3/bli_l3_sup.c, currently assumes that non-x86
  platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
  builds.
- Add guards around omp pragma to avoid possible gcc
  compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
  kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.

AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
2024-08-06 06:56:01 -04:00
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Nallani Bhaskar
e712673ab7 Peformance fixes for gcc compiler in fringe kernels
Description:
1. GCC avoiding loading b into registers in m fringe
   kenrels of int8 kernels. Instead gcc generating
   fma with memory as an operand for B input.

2. This is causing performance regression for larger n
   where each fma needs to load the input from memory
   again and again.

3. This is observed with gcc but not with clang.

4. Inserted dummy shuffle instructions for b data to
   further explicitly tell compiler that b needs to be in
   registers.

   AMD-Internal: SWLCSG-2948

Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6
2024-08-05 14:31:22 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
mkadavil
9f5fec7713 Matrix MUL op support in element wise operations API for bfloat16.
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.

AMD Internal: [SWLCSG-2947, SWLCSG-2953]

Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
2024-08-05 08:29:42 +05:30
Deepak Negi
80bf6249f0 Matrix MUL post-operation support for float(bf16|f32) LPGEMM APIs.
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2953]

Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
2024-08-05 08:25:32 -04:00
Nallani Bhaskar
4c2f436cce Peformance fixes for gcc compiler in fringe kernels
Description:
1. GCC avoiding loading b into registers in m fringe
   kenrels of int8 kernels. Instead gcc generating
   fma with memory as an operand for B input.

2. This is causing performance regression for larger n
   where each fma needs to load the input from memory
   again and again.

3. This is observed with gcc but not with clang.

4. Inserted dummy shuffle instructions for b data to
   further explicitly tell compiler that b needs to be in
   registers.

5. Moved packb_s4_to_bf16 under JIT macro to resovle
   compilation  issue with gcc version < 11.2

AMD-Internal: SWLCSG-2948

Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a
2024-08-05 08:13:06 -04:00
mkadavil
f040ba617f Element wise operations API for bfloat16 input matrix in LPGEMM.
-This API supports applying element wise operations (eg: post-ops) on a
bfloat16 input matrix to get an output matrix of the same(bfloat16) or
upscaled data type (float).
-Benchmarking/testing framework for the same is added.

AMD Internal: SWLCSG-2947

Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2
2024-08-05 07:17:08 -04:00
Arnav Sharma
0a5c057475 DGEMV Optimizations for Tiny Sizes
- Added reference kernel for dgemv that handles computation for tiny
  sizes (m < 8 && n < 8).

- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
  row/column storage schemes as well as transpose and no transpose
  cases.

- Added additional unit-tests for functional verification.

AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
2024-08-05 12:19:42 +05:30
Vignesh Balasubramanian
9843bd0317 Tuning the decision logic to choose SUP vs Native for ZGEMM
- Added an additional decision logic to choose between SUP and
  Native paths for zen4 and zen5 micro-architectures, based on
  the input dimensions. This logic has been added to the
  architecture-specific thresholds functions, that are registered
  in the context.

- The decision logic will overrule the discrete thresholds present
  in the zen4 and zen5 contexts.

AMD-Internal: [CPUPL-5547]
Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03
2024-08-03 19:08:07 +05:30
Shubham Sharma
0d95fcf20c Revert "DGEMM Native AVX512 updates"
This reverts commit f378fc57b5.

Reason for revert: Causing Failure

AMD-Internal: [CPUPL-5262]
Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f
2024-08-02 07:32:28 -04:00
Moripalli Chitra
8b486e8d14 Added new decision logic to choose between 6x8 dgemm kernel vs 24x8 kernel. The decision is based on the values of "m, n and k".
Change-Id: I307ff002797ccef5bd61106b808cecb069b91fd6
2024-08-02 14:18:58 +05:30
Ruchika Ashtankar
92fbd04238 DGEMM SUP Optimizations for Turin
- Introduced a new 24x8 column preferred DGEMM sup kernel for zen5.
- A prefetch logic is modified compared to zen4 24x8 sup kernels.
- Earlier, next panel of A is prefetched into L2 cache,
  which is now modified to prefetching the second next column
  of the current panel of A into L1 cache.
- B and C prefetches are enabled and unchanged.
- Tuned MC, KC and NC block sizes for new kernel.

AMD-Internal: [CPUPL-5262]
Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d
2024-08-02 04:00:51 -04:00
Ruchika Ashtankar
5760e06100 Threshold tuning for DGEMM SUP for zen5
- New Decision threshold constants are added to decide between
double precision sup vs native dgemm code-path for zen5 processors.
- The decision is based on the values of m, n and k.

AMD-Internal: [CPUPL-5262]
Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040
2024-08-02 11:34:32 +05:30
Vignesh Balasubramanian
4ec2bad744 Updating reduction step of AVX512 DNRM2 API
- Updated the final reduction of partial sums to use scalar accumulation
  entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic.
  This will in turn change the associativity and the rounding-off
  pattern in the reduction step.

- Defined a union data-type to do the same, by having a 512-bit
  register and a double-precision array as its members.

- Updated the declaration and usage of the register variable according
  to the union definition, for uniformity.

AMD-Internal: [CPUPL-5472]
Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679
2024-08-01 17:09:17 +05:30
Shubham Sharma.
f378fc57b5 DGEMM Native AVX512 updates
- In the initial patch - for m, n non-multiple of MR and NR
  respectively we are calling bli_dgemm_ker_var2. Now we have
  implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
  This will result in better performance for matrix sizes
  like m=4000 or greater when running on single thread.


AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
2024-07-31 12:23:34 -04:00
Vignesh Balasubramanian
f23b8e636b AVX2 and AVX512 optimizations for DAXPYV
- Removed some of the unrolling factors that affected the
  performance of AVX2 DAXPYV kernel. In addition to improving
  the current performance on sizes compatible to single-threaded
  runs, this will now perform better for tiny sizes as well
  since the overhead to reach the computation is less.

- Updated the vector partitioning logic, by using
  bli_thread_range_sub( ... ), which ensures that there is no
  false sharing among multiple threads.

- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
  or zen4 and zen5 micro-architectures.

AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
2024-07-31 09:24:36 -04:00
Hari Govind S
e2e95a09b0 Fixing missing registers in end_asm for copyv APIs
-  Added the missing registers in end_asm for scopy,
   dcopy and zcopy APIs.

-  Removed unnecessary registers from end_asm for scopy
   and dcopy APIs.

-  Corrected mistakes in the comments.

Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
2024-07-30 15:09:52 +05:30
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
49949f488f Implemented on-the-go pack kernel for s4->bf16
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
  new LPGEMM APIs are added where datatypes are A: bf16,
  B: int4, C: f32/bf16. To support this, B matrix will
  be reordered with type still being int4. New pack
  kernels that packs the reordered B matrix after
  converting the data from int4 to bf16 and applying
  zero-point and scale are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
2024-07-25 04:13:05 +05:30
mkadavil
7114376519 New kernels for int4 B matrix reordering following BF16 kernel schema.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
2024-07-25 01:10:13 -04:00
Hari Govind S
eacad443e3 Optimization for DCOPY and SCOPY API
-  Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
   kernel.

-  Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
   and "bli_scopyv_zen4_asm_avx512" kernels.

-  Replaced existing load balancing algorithm for dcopy API with
   "bli_thread_range_sub" algorithm.

-  Included AOCL-dynamic values for optimial number of threads
   for zen5 architecture.

AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
2024-07-24 08:23:07 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
mkadavil
42e539b878 Quantization (scale + zero point) updates/fixes for BF16 LPGEMM api.
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.

AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018
2024-07-23 04:39:14 +05:30
Shubham Sharma
16c56e0101 Added 24x8 triangular kernels for DGEMMT SUP
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
   24x8 triangular AVX512 DGEMMT SUP kernels are added.
 - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
   pattern repeats every 24x24 block of C. To cover this 24x24 block,
   3 kernels are needed for one variant of DGEMMT. A total of 6
   kernels are needed to cover both upper and lower variants.
 - In order to maximize code reuse, the 24x8 kernels are broken
   into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
   diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
   full GEMM part is computed by 24x8 DGEMM SUP kernel.
 - Changes are made in framework to enable the use of these kernels.

AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30