Commit Graph

3454 Commits

Author SHA1 Message Date
Shubham Sharma.
45d82a1ebf Threshold tuning for DTRSM on zen5
- Added new decision logic to choose between
  native TRSM vs unpacked small TRSM for
  double precision.
- The changes are made for zen5 processor.

AMD-Internal: [CPUPL-5534]
Change-Id: I5204f6df111edec27d006daeb1c2b535a67b3e46
2024-08-01 11:27:28 -04:00
Vignesh Balasubramanian
4ec2bad744 Updating reduction step of AVX512 DNRM2 API
- Updated the final reduction of partial sums to use scalar accumulation
  entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic.
  This will in turn change the associativity and the rounding-off
  pattern in the reduction step.

- Defined a union data-type to do the same, by having a 512-bit
  register and a double-precision array as its members.

- Updated the declaration and usage of the register variable according
  to the union definition, for uniformity.

AMD-Internal: [CPUPL-5472]
Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679
2024-08-01 17:09:17 +05:30
Edward Smyth
75f21182bd GTestSuite: IIT and ERS test improvements
Various improvements:
- Where appropriate, test both:
  - with nullptr for suitable arguments that should never
    be touched.
  - with all arguments correct except the one we want to
    test, to check we are not returning early because
    another argument is a nullptr.
- Test incorrect values for order argument in CBLAS calls.
- Test early exits with limited data changes, e.g. set
  C to 0 or scale C in GEMM when alpha = 0.
- Bugfix in gemmt test when alpha is 0 and beta is 1.
- Use reference library gemmt for comparison when library
  is not netlib BLAS.

AMD-Internal: [CPUPL-4500]
Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b
2024-07-31 15:36:01 -04:00
Shubham Sharma.
f378fc57b5 DGEMM Native AVX512 updates
- In the initial patch - for m, n non-multiple of MR and NR
  respectively we are calling bli_dgemm_ker_var2. Now we have
  implemented macro-kernel for these fringe cases as well.
- Replaced RBP register with R11 in the macro-kernel.
- Retuned MC, KC and NC with these new changes.
  This will result in better performance for matrix sizes
  like m=4000 or greater when running on single thread.


AMD-Internal: [CPUPL-5262]
Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a
2024-07-31 12:23:34 -04:00
Eleni Vlachopoulou
20d6a9a9f3 CMake: Add installation of .pc files.
AMD-Internal: [CPUPL-4938]
Change-Id: Iaf1ad702e61d8a81ee9ae6496ff3ba0dda21eceb
2024-07-31 11:54:57 -04:00
Vignesh Balasubramanian
f23b8e636b AVX2 and AVX512 optimizations for DAXPYV
- Removed some of the unrolling factors that affected the
  performance of AVX2 DAXPYV kernel. In addition to improving
  the current performance on sizes compatible to single-threaded
  runs, this will now perform better for tiny sizes as well
  since the overhead to reach the computation is less.

- Updated the vector partitioning logic, by using
  bli_thread_range_sub( ... ), which ensures that there is no
  false sharing among multiple threads.

- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
  or zen4 and zen5 micro-architectures.

AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
2024-07-31 09:24:36 -04:00
Hari Govind S
e2e95a09b0 Fixing missing registers in end_asm for copyv APIs
-  Added the missing registers in end_asm for scopy,
   dcopy and zcopy APIs.

-  Removed unnecessary registers from end_asm for scopy
   and dcopy APIs.

-  Corrected mistakes in the comments.

Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498
2024-07-30 15:09:52 +05:30
Meghana Vankadari
d5b4d3aa5e Fixing control flow in aocl_gemm_bf16s4f32of32|bf16
- Fixed framework of bf16s4f32of32 API to correct
  pointer updations.
- Modified pre_op structure to exclude pre-op-offset.
  Now offset is passed as a separate parameter to the
  scale-pack functions.
- Fixed work-distribution among threads in MT scenario.
- Added Blocksizes and kernel-pointers and verified
  functionality for the new API.

AMD-Internal: [SWLCSG-2943]
Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d
2024-07-29 05:12:09 -04:00
Edward Smyth
b90e12dfa4 GTestSuite: copyright notice
Standardize format of copyright notice.

AMD-Internal: [CPUPL-4500]
Change-Id: I6bde64c15ff639492dd0de95423c660112a37e2c
2024-07-26 15:34:41 -04:00
Edward Smyth
ea286cf6f6 GTestSuite: whitespace at end of lines
Unnecessary whitespace (spaces, tabs) at the end of lines
has been removed.

AMD-Internal: [CPUPL-4500]
Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22
2024-07-26 12:12:56 -04:00
Edward Smyth
4183efa722 GTestSuite: No newline at end of file
Add missing newline at the end of these files.

AMD-Internal: [CPUPL-4500]
Change-Id: I835cc73de0008b66ae3cf77fbb3daa1c8fcaaa7f
2024-07-26 11:42:57 -04:00
Edward Smyth
46fe3f3dcb GTestSuite: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency.

AMD-Internal: [CPUPL-4500]
Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8
2024-07-26 11:09:06 -04:00
Edward Smyth
8848ecb103 Improvements to CBLAS xerbla functionality
Currently the CBLAS xerbla always prints and always stops
on error. This commits adds similar functionality to the
regular BLAS xerbla to match the changes in 6d0444497f,
namely:

- Option to stop in xerbla on error. This is controlled by
  setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
  is controlled by setting the environment variable
  BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
  assuming xerbla was not set to stop on error. Example call is

     info = bli_info_get_info_value();

The default behaviour remains to print but has been changed to
not stop on error, i.e. the equivalent to

     export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0

AMD-Internal: [CPUPL-5361]
Change-Id: Icd6125fd60da139e3ec0969e52337a1ed515f0a2
2024-07-26 10:36:37 -04:00
Vignesh Balasubramanian
68c54297bd Fixing compiler warnings when configuring BLIS without OpenMP
- Adjusted the macro-guards for variables specific to
  multithreading, when BLIS is configured with OpenMP.

- This included calling the single-threaded kernel directly
  if increment is 0 as well, since this would remove an
  unnecessary dependency on one of the variables used only
  when we enable OpenMP.

- Further updated the condition to pack the vector, to
  avoid it when increment is 0. In this case, we directly
  call the kernel.

AMD-Internal: [CPUPL-5480]
Change-Id: I31a9c6e3ffc3c4f9d5b03ed8745919ad65c99c79
2024-07-25 10:29:33 -04:00
mkadavil
ec8c39541e Test/benchmark framework updates to test WOQ workflow.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
have been developed where data types are A:bf16, B:int4 and C:f32/bf16.
The testing and benchmarking framework for the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Icdc1d60819a23dd9f41382499d1a3c055c5edc17
2024-07-25 06:44:37 +05:30
Nallani Bhaskar
c6dd7c1b4b Added new API in aocl_gemm to support A bf16 data type and B s4 data type
Description:

1. Added a new API aocl_gemm_bf16s4f32of32 to support
   for WoQ (Weight-only-Quantization) in LLM's

2. The API supports only reordered B matrix of data
   size signed 4 bits (S4).

3. Substracting zero point and multiplying with scale
   on B matrix is performed in packing B.

4. zero point and scale data should be passed by user
   through pre-ops data structure.

5. The API is still in experimental state and NOT tested.

   AMD-Internal: SWLCSG-2943

Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee
2024-07-25 11:59:03 +00:00
Meghana Vankadari
49949f488f Implemented on-the-go pack kernel for s4->bf16
Details:
- To enable Weight-only-Quantization(WOQ) workflow,
  new LPGEMM APIs are added where datatypes are A: bf16,
  B: int4, C: f32/bf16. To support this, B matrix will
  be reordered with type still being int4. New pack
  kernels that packs the reordered B matrix after
  converting the data from int4 to bf16 and applying
  zero-point and scale are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03
2024-07-25 04:13:05 +05:30
mkadavil
7114376519 New kernels for int4 B matrix reordering following BF16 kernel schema.
-To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs
are required where data types are A:bf16, B:int4 and C:f32/bf16. It
is expected that the BF16 kernels will be reused within this API and
subsequently the B matrix needs to be reordered following the BF16
kernel schema, but with the reordered matrix type still being int4. To
address this, new BF16 reorder kernels enabling the same are added.

AMD-Internal: [SWLCSG-2943]
Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738
2024-07-25 01:10:13 -04:00
Hari Govind S
eacad443e3 Optimization for DCOPY and SCOPY API
-  Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
   kernel.

-  Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
   and "bli_scopyv_zen4_asm_avx512" kernels.

-  Replaced existing load balancing algorithm for dcopy API with
   "bli_thread_range_sub" algorithm.

-  Included AOCL-dynamic values for optimial number of threads
   for zen5 architecture.

AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
2024-07-24 08:23:07 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
Shubham Sharma
15ef6532e9 BugFix in DGEMMT SUP AVX512 code path
- Logic to calculate the kernel index in AVX512
  DGEMMT SUP framework is incorrect.
- The granularity for workload distribution along N
  dimension is NR(8), whereas current logic to pick
  diagonal kernel assumes the granularity to be MR (24).
- To Fix this, the logic to determine the kernel index is
  changed, instead of relying solely on n_offset, the kernel
  index is derived depending on distance from the diagonal.
- If distance from diagonal is greater than
  LCM of (MR and NR) - NR, that that means the current micro
  panel is not a diagonal micro panel.
- If the micro panel is a diagonal micro panel, then the
  distance from diagonal is equal to the M dimension for
  initial full GEMM region or empty region of diagonal
  kernel. This info can be used to determine the kernel index.

AMD-Internal: [CPUPL-5440]
Change-Id: I640d3a1b43e63b24bc9f0ed4a67cced45f6fa3b3
2024-07-24 06:36:34 +00:00
mkadavil
42e539b878 Quantization (scale + zero point) updates/fixes for BF16 LPGEMM api.
-_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc
(<gcc 12.2) and subsequently throws a compilation error. This is fixed
by replacing this intrinsic with a macro that achieves the bf16 to f32
conversion via shift operations.
-Bug fixes in the vector scale factor load in fringe kernels.

AMD-Internal: [SWLCSG-2945]
Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018
2024-07-23 04:39:14 +05:30
Shubham Sharma
16c56e0101 Added 24x8 triangular kernels for DGEMMT SUP
- In order to reuse 24x8 AVX512 DGEMM SUP kernels,
   24x8 triangular AVX512 DGEMMT SUP kernels are added.
 - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal
   pattern repeats every 24x24 block of C. To cover this 24x24 block,
   3 kernels are needed for one variant of DGEMMT. A total of 6
   kernels are needed to cover both upper and lower variants.
 - In order to maximize code reuse, the 24x8 kernels are broken
   into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8
   diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8
   full GEMM part is computed by 24x8 DGEMM SUP kernel.
 - Changes are made in framework to enable the use of these kernels.

AMD-Internal: [CPUPL-5338]
Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a
2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian
b48e864e82 AVX512 optimizations for DAXPBYV API
- Implemented AVX512 computational kernel for DAXPBYV
  with optimal unrolling. Further implemented the other
  missing kernels that would be required to decompose
  the computation in special cases, namely the AVX512
  DADDV and DSCAL2V kernels.

- Updated the zen4 and zen5 contexts to ensure any query
  to acquire the kernel pointer for DAXPBYV returns the
  address of the new kernel.

- Added micro-kernel units tests to GTestsuite to check
  for functionality and out-of-bounds reads and writes.

AMD-Internal: [CPUPL-5406][CPUPL-5421]
Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6
2024-07-22 11:32:19 +05:30
Shubham Sharma
75df1ef218 Removed -fno-tree-loop-vectorize from kernel flags
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
  instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
  vectorize the code which results in usages of
  vector registers, if the auto vectorized function
  is using intrinsics then the total numbers of vector
  registers used by intrinsic and auto vectorized
  code becomes more than the registers
  available in machine which causes read and writes
  to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
  do not use any intrinsic code do not get auto
  vectorized.
- To get optimal performance for both blis and lpgemm,
  this flag is enabled for lpgemm kernels only.

Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
2024-07-19 00:49:52 -04:00
Vignesh Balasubramanian
cec9fdcc6e Framework enhancements for ?AXPBYV APIs
- Implemented a new front-end for the BLAS/CBLAS calls
  to ?AXPBYV(BLAS-extension API), that is intended to
  be compiled only on Zen micro-architectures(as per the
  existing build system).

- This new front-end makes the framework lightweight for
  BLAS/CBLAS calls to ?AXPBYV, by directly querying the
  architecture ID and deploying the associated computational
  kernel.

- Further updated the rerouting to other L1 kernels based
  on alpha and beta value. This was initially present in
  the Typed-API interface. It has been moved inside the
  respective kernels, and only necessary rerouting is done
  to specific L1 kernels to avoid redundant checks.

AMD-Internal: [CPUPL-5406]
Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e
2024-07-18 10:06:31 -04:00
mkadavil
d37c91dffa Quantization (scale + zero point) support for BF16 LPGEMM api.
-Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point)
instead of just type conversion in aocl_gemm_bf16bf16f32obf16.
-Support for multiple scale/sum/matrix_add/bias post-ops in a single
LPGEMM api call.
-Post-ops mask related fixes in lpgemv kernels .
-Additional scale post-ops sanity checks.

AMD-Internal: [SWLCSG-2945]
Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb
2024-07-18 05:32:51 -04:00
Arnav Sharma
d5e29e3c7b CSCALV Framework Bugfix
- Fixed bug for non-zen architecture where CSCALV framework incorrectly
  fetches the dcomplex (ZSCALV) kernel pointer.

AMD-Internal: [CPUPL-5299]
Change-Id: I1d16588aa9dffd8b9dca69860026e377fa74d547
2024-07-17 00:27:47 +05:30
Hari Govind S
38824244d5 Implementation of AXPYF Kernels for DTRSV
-  Implemented two new axpyf kernels for fused factors 8 and 12
   by manually unrolling the loops. Used to achieve better performance
   in var2 case.

AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
2024-07-16 06:23:01 -04:00
Arnav Sharma
4aa66f108e Added CSCALV AVX512 Kernel
- Added CSCALV kernel utilizing the AVX512 ISA.

- Added function pointers for the same to zen4 and zen5 contexts.

- Updated the BLAS interface to invoke respective CSCALV kernels based
  on the architecture.

- Added UKR tests for bli_cscalv_zen_int_avx512( ... ).

AMD-Internal: [CPUPL-5299]
Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3
2024-07-15 07:17:43 -04:00
srikanth pogula
1d7f6d414f Bench APPs - change in Print statement for more params
>Made changes in the print statements in bench files
         to print all the params of the individual APIs

        > Ex : removing tab & adding Func param
         "Dt\t n\t incx\t incy\t gflops\n" --> "Func Dt n incx incy gflops\n"

        > Ex : adding func, incx, incy params
         "dt_ch, n, alpha_r, alpha_i, beta_r, beta_i, gflops" --> "tmp, dt_ch, n, alpha_r, alpha_i, incx,
                                                                       beta_r, beta_i, incy, gflops"

Change-Id: Ib5d151d7472d3f88c13a85a615a447dfa5e6b528
2024-07-11 02:04:19 -04:00
Nallani Bhaskar
d5133e4363 Fixed linking issue with bli_print_msg function
Description:

In recent changes bli_print_msg is used in lpgemm
test application file bench_lpgemm.c for printing error
message. bli_print_msg is a blis library function which
is not exported for the usage of applications, because
of which linking failed when blis shared library is used
to build.

Updated bli_print_msg with printf in the bench_lpgemm.c

AMD Internal: CPUPL-5326

Change-Id: I021849baa6881bd997013e42013db1c5c711627f
2024-07-09 23:18:53 +05:30
Shubham Sharma.
a7744361e4 DGEMM optimizations for Turin Classic
- Introduced new 8x24 macro kernels.
   - 4 new kernels are added for beta 0, beta 1, beta -1
      and beta N.
   - IR and JR loop moved to ASM region.
   - Kernels support row major storage scheme.
   - Prefetch of current micro panel of C is enabled.
   - Kernel supports negative offsets for A and B matrices.
 - Moved alpha scaling from DGEMM kernel to B pack kernel.
 - Tuned blocksizes for new kernel.
 - Added support for alpha scaling in 24xk pack kernel.
 - Reverted back to old b_next computation
   in gemm_ker_var2.
 - BugFix in 8x24 DGEMM kernel for beta 1,
   comparsion for jmp conditions was done using integer
   instructions, which caused beta 1 path to never be taken.
   Fixed this by changing the comparsion to double.

AMD-Internal: [CPUPL-5262]
Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f
2024-07-09 07:53:27 -04:00
vignbala
236d092656 AVX512 optimizations for ZGEMM to handle k = 1 cases
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
  be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
  for handling the GEMM computation with inputs having k = 1,
  with the transpose values being N(for column-major) and T(for
  row-major).

- Updated the zgemm_blis_impl( ... ) layer to query the architecture
  ID and invoke the AVX2 or AVX512 kernel accordingly.

- Added API level tests for accuracy and code-coverage, as well as
  micro-kernel tests for verifying functionality and out-of-bounds
  memory accesses.

AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
2024-07-09 07:07:24 -04:00
Hari Govind S
627bf0b1ba Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API
-  Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
   'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
   zen5  config.

-  Implemented multithreading support and AOCL-dynamic
   for ZAXPY API.

-  Utilized 'bli_thread_range_sub' function to achieve
   better work distribution and avoid false sharing.

AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
2024-07-09 01:29:53 -04:00
Varaganti, Kiran
2ac24d1f9c Avoided Extra copy of "c" matrix
Initailized c_save instead of 'c" and then removed copying c to c_save.
Because at the start every n_repeats iteration we are copying back c_save to c.
Therefore if we initialize c_save, we can avoid extra copy of "c" to c_save before calling
GEMM. For very large sizes matrix initialization takes considerable amount of time. This can
be reduced now.

Change-Id: I2c6ffe169e991607314897cb0c1fbfc0d74ef179
2024-07-09 00:54:03 -04:00
Edward Smyth
2ee46a3a3a Merge commit 'cfa3db3f' into amd-main
* commit 'cfa3db3f':
  Fixed bug in mixed-dt gemm introduced in e9da642.
  Removed support for 3m, 4m induced methods.
  Updated do_sde.sh to get SDE from GitHub.
  Disable SDE testing of old AMD microarchitectures.
  Fixed substitution bug in configure.
  Allow use of 1m with mixing of row/col-pref ukrs.

AMD-Internal: [CPUPL-2698]
Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f
2024-07-08 06:09:11 -04:00
Eleni Vlachopoulou
db2e353362 CMake: Adding presets for zen5 configuration with clang & gcc compiler.
Change-Id: Ieacc5eeaf8e9f4e1c77e2ff5c6fb455f7ff93393
2024-07-04 05:22:28 -04:00
Meghana Vankadari
4e6fa17c08 Bug fix in LPGEMV for INT8 APIs
Details:
- Corrected the usage of vpdpbusd instruction in
  GEMV implementation for INT8 APIs.
- Modified bench to fill matrices with values
  ranging between -5 and +5 whenever the datatype is
  a signed integer.

Change-Id: I457462b888b667d8a34c53de762e9b4aee784ecc
2024-06-27 04:22:04 +05:30
mkadavil
a26c85333a Int4 B matrix reordering support fixes in LPGEMM.
Reordering B matrix of datatype int4 is done as per the pack schema
requirements of u8s8s32 kernel. However for fringe cases, the
matrix pointer increments need to be halved to account for the half
byte size of int4 elements.

AMD-Internal: [SWLCSG-2390]
Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb
2024-06-26 05:39:45 +05:30
Edward Smyth
8de8dc2961 Merge commit '81e10346' into amd-main
* commit '81e10346':
  Alloc at least 1 elem in pool_t block_ptrs. (#560)
  Fix insufficient pool-growing logic in bli_pool.c. (#559)
  Arm SVE C/ZGEMM Fix FMOV 0 Mistake
  SH Kernel Unused Eigher
  Arm SVE C/ZGEMM Support *beta==0
  Arm SVE Config armsve Use ZGEMM/CGEMM
  Arm SVE: Update Perf. Graph
  Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
  Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
  A64FX Config Use ZGEMM/CGEMM
  Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
  Arm SVE Add SGEMM 2Vx10 Unindexed
  Arm SVE ZGEMM Support Gather Load / Scatt. St.
  Arm SVE Add ZGEMM 2Vx10 Unindexed
  Arm SVE Add ZGEMM 2Vx7 Unindexed
  Arm SVE Add ZGEMM 2Vx8 Unindexed
  Update Travis CI badge
  Armv8 Trash New Bulk Kernels
  Enable testing 1m in `make check`.
  Config ArmSVE Unregister 12xk. Move 12xk to Old
  Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
  Register firestorm into arm64 Metaconfig
  Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
  Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
  Add test for Apple M1 (firestorm)
  Firestorm CPUID Dispatcher
  Armv8 GEMMSUP Edge Cases Require Signed Ints
  Make error checking level a thread-local variable.
  Fix data race in testsuite.
  Update .appveyor.yml
  Firestorm Block Size Fixes
  Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
  Move unused ARM SVE kernels to "old" directory.
  Add an option to control whether or not to use @rpath.
  Fix $ORIGIN usage on linux.
  Arm micro-architecture dispatch (#344)
  Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
  Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
  Armv8 Fix 6x8 Row-Maj Ukr
  Apply patch from @xrq-phys.
  Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
  bli_error: more cleanup on the error strings array
  Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
  Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
  Fix config_name in bli_arch.c
  Arm Whole GEMMSUP Call Route is Asm/Int Optimized
  Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
  Header Typo
  Arm: DGEMMSUP ??r(rv) Invoke Edge Size
  Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
  Arm: Implement GEMMSUP Fallback Method
  Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
  Added Apple Firestorm (A14/M1) Subconfig
  Arm64 8x4 Kernel Use Less Regs
  Armv8-A Supplimentary GEMMSUP Sizes for RD
  Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
  Armv8-A Adjust Types for PACKM Kernels
  Armv8-A GEMMSUP-RD 6x8m
  Armv8-A GEMMSUP-RD 6x8n
  Armv8-A s/d Packing Kernels Fix Typo
  Armv8-A Introduced s/d Packing Kernels
  Armv8-A DGEMMSUP 6x8m Kernel
  Armv8-A DGEMMSUP Adjustments
  Armv8-A Add More DGEMMSUP
  Armv8-A Add GEMMSUP 4x8n Kernel
  Armv8-A Add Part of GEMMSUP 8x4m Kernel
  Armv8A DGEMM 4x4 Kernel WIP. Slow
  Armv8-A Add 8x4 Kernel WIP

AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
2024-06-25 05:48:46 -04:00
Edward Smyth
43d36b9f66 AOCL_ENABLE_INSTRUCTIONS improvements 2
Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is
unnecessary and incorrectly caused AVX512 code to be run
on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2
or equivalent options was selected.

Replace with code to select kernel in a similar way to other
dgemm code paths and other APIs. Note that at present AVX2 code
is used the smallest matrix sizes on all zen platforms.

AMD-Internal: [CPUPL-5078]
Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85
2024-06-25 04:57:38 -04:00
Vignesh Balasubramanian
02da190560 AVX512 optimizations for DNRM2
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
  computational kernel for DNRM2 API.

- Updated the header to include this kernel signature, as well
  as the framework layer to use this function in case of ZEN4
  and ZEN5 configurations.

- Updated the tipping points for ideal thread setting in DNRM2
  for ZEN5 micro-architecture. These thresholds are specific
  to the library's linkage to LLVM's OpenMP or GNU's OpenMp.

- Further abstracted the AOCL-DYNAMIC logic to separate functions
  for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).

- Further updated the ?NRM2 framework to accommodate the necessary
  changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
  kernel, when needed.

- Added micro-kernel and memory tests for this kernel in GTestsuite,
  to validate accuracy and out-of-bounds read and write.

AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
2024-06-24 08:50:36 -04:00
mkadavil
a5c4a8c7e0 Int4 B matrix reordering support in LPGEMM.
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.

AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
2024-06-24 07:55:34 -04:00
Meghana Vankadari
c1e063e65c Fix for offset issue while reading constants from JIT code
Details:
- For a variable x, Using address of x in an instruction throws
  exception if the difference between &x and access position is
  larger than 2 GiB. To solve this issue all variables are stored
  within the JIT code section and are accessed using relative addressing.

- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.

AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
2024-06-24 07:14:15 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Arnav Sharma
aa3adb8d69 Updated DOTXF and AXPYF Kernels
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
  when b_n > fuse_factor.

- The fused kernels are expected to invoke respective Level-1 kernels
  iteratively when b_n > fuse_factor.

AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
2024-06-20 04:41:41 -04:00
Mangala V
e9124ffca7 BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C

FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0

Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C

- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite

AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
2024-06-20 02:58:43 -04:00
Mangala V
90fe795c46 Gtestsuite: Enabled memory test for ZGEMM for k=0
AMD_Internal: [CPUPL-4657]

Change-Id: Ic5f4d24184f05e0f57634845b4fb3312b3a416f6
2024-06-20 02:51:47 -04:00
Shubham Sharma
1d6dd726cd Fixed Prefetch in Turin DGEMM kernel
- Fixed the prefetch of next micro panel
  of B matrix in 8x24 DGEMM kernel.

Change-Id: Id84bb2841abb86bda780062d67266377fda12038
2024-06-20 10:31:08 +05:30