Commit Graph

348 Commits

Author SHA1 Message Date
varshav2
ef04388a44 Added AVX2 support for BF16 kernels: Row major
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
   In order to support AVX2 kernels, the BF16 input has to be converted
   to F32 and then the F32 kernels has to be executed.
 - Added un-pack function for the B-Matrix, which does the unpacking of
   the Re-ordered BF16 B-Matrix and converts it to Float.
 - Added a kernel, to convert the matrix data from Bf16 to F32 for the
   give input.
 - Added a new path to the BF16 5LOOP to work with the BF16 data, where
   the packed/unpacked A matrix is converted from BF16 to F32. The
   packed B matrix is converted from BF16 to F32 and the re-ordered B
   matrix is unre-ordered and converted to F32 before feeding to the
   F32 micro kernels.
 - Removed AVX512 condition checks in BF16 code path.
 - Added the Re-order reference code path to support BF16 AVX2.
 - Currently the F32 AVX-2 kernels supports only F32 BIAS support.
   Added BF16 support for BIAS post-op in F32 AVX2 kernels.
 - Bug fix in the test input generation script.

AMD Internal : [SWLCSG - 3281]

Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
2025-02-10 08:18:52 -05:00
Mithun Mohan
bffa92ec93 Deprecate S16 LPGEMM APIs.
-The following S16 APIs are removed:
1. aocl_gemm_u8s8s16os16
2. aocl_gemm_u8s8s16os8
3. aocl_gemm_u8s8s16ou8
4. aocl_gemm_s8s8s16os16
5. aocl_gemm_s8s8s16os8
along with the associated reorder APIs and corresponding
framework elements.

AMD-Internal: [CPUPL-6412]

Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b
2025-02-07 11:43:28 +00:00
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Hari Govind S
3d2653f1ab DDOTV Optimization for ZEN3 Architecture
- Reduced the blocking size of 'bli_ddotv_zen_int10'
  kernel from 40 elements to 20 elements for better
  utilization of vector registers

- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
  kernel with 'if' conditions to handle reminder
  iterations. As only a single iteration is used when
  reminder is less than the primary unroll factor.

- Added a conditional check to invoke the vectorized
  DDOTV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal
  for single-threaded execution. Thus, we avoid the
  call to bli_nthreads_l1() function to set the ideal
  number of threads.

- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
  kernel.

AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
2025-02-04 06:01:04 -05:00
Deepak Negi
4ade159800 Buffer scale support for matrix add and matrix mul post-ops in f32 API.
Description:
1. Support has been added to scale buffer values using both scalar and
   vector scale factors before matrix add or matrix mul post-ops.

AMD-Internal: CPUPL-6340

Change-Id: Ie023d5963689897509ef3d5784c3592791e57125
2025-01-30 07:09:18 -05:00
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Hari Govind S
c149b5a98b Optimisation of DCOPY kernel in zen3
-  Using 'if' condition instead of 'for'loop to handle fringe
   cases. 'for' loop is redundant for handling reminder iterations
   as only a single iteration is used when reminder is less than
   primary unroll factor.

AMD-Internal: [CPUPL-5594]
Change-Id: I8cebc037742ee47961869e22e2471e550fcd99e9
2025-01-24 05:55:59 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Vignesh Balasubramanian
a80436ab21 Standardizing the EVT compliance of {S/D}AMAXV API
- Updated the existing AVX2 {S/D}AMAXV kernels to comply
  to the standard when having exception values. This makes
  it exhibit the same behaviour as it AVX512 variants.
  Provided additional optimizations with loop unrolling.

- Removed redundant early return checks inside the kernels,
  since they have been abstracted to a higher layer.

- Updated the unit-tests(micro-kernel) and exception value
  tests for appropriate code-coverage. Also re-enabled the
  exception value tests.

AMD-Internal: [CPUPL-4745]
Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9
2025-01-13 06:56:31 -05:00
Edward Smyth
97ede96ed4 Correct duplicate object file names
Some kernel file names were the same for different sub-configurations,
which could result in duplicate copies of the same object being archived
depending upon the order of (re-)compiling the source files. Rename the
files to be specific to each sub-configuration to avoid this problem.

AMD-Internal: [CPUPL-5895]
Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb
2025-01-10 06:03:36 -05:00
Meghana Vankadari
852cdc6a9a Implemented batch_matmul for f32 & int8 datatypes
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.

AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
2025-01-10 04:10:53 -05:00
Meghana Vankadari
051c9ac7a2 Bug fixes in F32 and INT8 APIs
Details:
- Fixed few bugs in downscale post-op for f32 datatype.
- Fixed a bug in setting strides of packB buffer in
  int8 APIs.

Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c
2025-01-09 04:07:26 -05:00
harsdave
54b46ec1ed Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:

1. **IR Loop Optimization:**
   - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
     with `begin_asm` and `end_asm` calls, resulting in more efficient execution.

2. **JR Loop Integration:**
   - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
     of stack frame management for each JR iteration, thereby enhancing loop performance.

3. **Kernel Decomposition Strategy:**
   - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
   - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.

1. **Interleaved Scaling by Alpha:**
   - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
     and reduce latency.

2. **Efficient Mask Preparation:**
   - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
     minimizing unnecessary overhead.

3. **Broadcast Instruction Optimization:**
   - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
     the broadcast instruction is replaced with `mem_1to8`.
   - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
     dependency chains and improving execution efficiency.

4. **C Matrix Update Optimization:**
   - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
     This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
     performance bottlenecks and enhancing throughput.

These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.

This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.

For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.

AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
2024-12-13 00:03:00 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Deepak Negi
60a8c71a1a Sigmoid and Tanh post-operation support for int8 API's.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_<s8|u8>s8<s32|s16>o<s8|u8|s32> API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: I9df3aab02222f728ff9d1f292c7bc549f30176f0
2024-11-15 05:36:31 -05:00
Deepak Negi
146f3b2eb2 Sigmoid and Tanh post-operation support for f32 API.
Description:

Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_f32f32f32of32 API's

Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))

Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.

AMD-Internal: [SWLCSG-3178]

Change-Id: Iac0a907f6dea1d9cb82d9fd8716bfdbf1c33921d
2024-11-15 04:20:20 -04:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
varshav2
dabfdf484a Add Scale post-op for F32 API
- Implemented the Scale post-op for the F32 API for all kernels
 - f32_scale = (f32 * scale_factor) + offset
 - Added the bench inputs

Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
2024-10-21 06:08:31 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Vignesh Balasubramanian
ed682a429d Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels
- Added explicit typecast to the pointers that are passed
  to the _mm_prefetch( ... ) intrinsic, to avoid compiler
  warnings.

AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
2024-09-12 11:37:06 +05:30
Chandrashekara K R
2ff0125f11 CMake: Enabled ADDON(aocl_gemm) feature for Windows.
1. Updated datatype from __int64_t to int64_t. Since
   __int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows

Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
2024-08-23 07:05:28 -04:00
Vignesh Balasubramanian
fd47d0aebb Fixing compiler warnings when exporting internal BLIS kernels
- Added the attribute to export symbols, in the header file that
  contains the L1 kernel declarations. This attribute was previously
  added as part of the kernel definitions.

AMD-Internal: [CPUPL-4415]
Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710
2024-08-12 12:30:51 +05:30
Meghana Vankadari
5514c7a75f Added LPGEMV(n=1) kernels for s8s8s32os32|s8 and s8s8s16os16|s8 APIs
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n==1.

AMD-Internal: [SWLCSG-2354]
Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761
2024-08-09 06:17:52 -04:00
Edward Smyth
7fff7b4026 Code cleanup: Miscellaneous fixes
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
  in frame/3/bli_l3_sup.c, currently assumes that non-x86
  platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
  builds.
- Add guards around omp pragma to avoid possible gcc
  compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
  kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.

AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
2024-08-06 06:56:01 -04:00
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
mkadavil
9f5fec7713 Matrix MUL op support in element wise operations API for bfloat16.
-Matrix MUL op support added in main as well as fringe bfloat16 element
wise operations kernels.
-Benchmarking/testing framework for the same is added.
-Fixed issues in setting up post-ops node index.

AMD Internal: [SWLCSG-2947, SWLCSG-2953]

Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8
2024-08-05 08:29:42 +05:30
Deepak Negi
80bf6249f0 Matrix MUL post-operation support for float(bf16|f32) LPGEMM APIs.
This post-operation computes C = (beta*C + alpha*A*B) * D, where D
is a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2953]

Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd
2024-08-05 08:25:32 -04:00
Arnav Sharma
0a5c057475 DGEMV Optimizations for Tiny Sizes
- Added reference kernel for dgemv that handles computation for tiny
  sizes (m < 8 && n < 8).

- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
  row/column storage schemes as well as transpose and no transpose
  cases.

- Added additional unit-tests for functional verification.

AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
2024-08-05 12:19:42 +05:30
Moripalli Chitra
8b486e8d14 Added new decision logic to choose between 6x8 dgemm kernel vs 24x8 kernel. The decision is based on the values of "m, n and k".
Change-Id: I307ff002797ccef5bd61106b808cecb069b91fd6
2024-08-02 14:18:58 +05:30
Vignesh Balasubramanian
f23b8e636b AVX2 and AVX512 optimizations for DAXPYV
- Removed some of the unrolling factors that affected the
  performance of AVX2 DAXPYV kernel. In addition to improving
  the current performance on sizes compatible to single-threaded
  runs, this will now perform better for tiny sizes as well
  since the overhead to reach the computation is less.

- Updated the vector partitioning logic, by using
  bli_thread_range_sub( ... ), which ensures that there is no
  false sharing among multiple threads.

- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
  or zen4 and zen5 micro-architectures.

AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
2024-07-31 09:24:36 -04:00
Vignesh Balasubramanian
cec9fdcc6e Framework enhancements for ?AXPBYV APIs
- Implemented a new front-end for the BLAS/CBLAS calls
  to ?AXPBYV(BLAS-extension API), that is intended to
  be compiled only on Zen micro-architectures(as per the
  existing build system).

- This new front-end makes the framework lightweight for
  BLAS/CBLAS calls to ?AXPBYV, by directly querying the
  architecture ID and deploying the associated computational
  kernel.

- Further updated the rerouting to other L1 kernels based
  on alpha and beta value. This was initially present in
  the Typed-API interface. It has been moved inside the
  respective kernels, and only necessary rerouting is done
  to specific L1 kernels to avoid redundant checks.

AMD-Internal: [CPUPL-5406]
Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e
2024-07-18 10:06:31 -04:00
vignbala
236d092656 AVX512 optimizations for ZGEMM to handle k = 1 cases
- Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to
  be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built
  for handling the GEMM computation with inputs having k = 1,
  with the transpose values being N(for column-major) and T(for
  row-major).

- Updated the zgemm_blis_impl( ... ) layer to query the architecture
  ID and invoke the AVX2 or AVX512 kernel accordingly.

- Added API level tests for accuracy and code-coverage, as well as
  micro-kernel tests for verifying functionality and out-of-bounds
  memory accesses.

AMD-Internal: [CPUPL-5249]
Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84
2024-07-09 07:07:24 -04:00
Edward Smyth
43d36b9f66 AOCL_ENABLE_INSTRUCTIONS improvements 2
Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is
unnecessary and incorrectly caused AVX512 code to be run
on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2
or equivalent options was selected.

Replace with code to select kernel in a similar way to other
dgemm code paths and other APIs. Note that at present AVX2 code
is used the smallest matrix sizes on all zen platforms.

AMD-Internal: [CPUPL-5078]
Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85
2024-06-25 04:57:38 -04:00
Vignesh Balasubramanian
6165001658 Bugfix and optimizations for ?AXPBYV API
- Updated the existing code-path for ?AXPBYV to
  reroute the inputs to the appropriate L1 kernel,
  based on the alpha and beta value. This is done
  in order to utilize sensible optimizations with
  regards to the compute and memory operations.

- Updated the typed API interface for ?AXPBYV to include
  an early exit condition(when n is 0, or when alpha is
  0 and beta is 1). Further updated this layer to query
  the right kernel from context, based on the input values
  of alpha and beta.

- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
  ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
  case handling in ?AXPBYV.

- Moved the early return with negative increments from ?SCAL2V
  kernels to its typed API interface.

- Updated the zen, zen2 and zen3 context to include function
  pointers for all these vector kernels.

- Updated the existing ?AXPBYV vector kernels to handle only
  the required computation. Additional cleanup was done to
  these kernels.

- Added accuracy and memory tests for AVX2 kernels of ?SETV
  ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs

- Updated the existing thresholds in ?AXPBYV tests for complex
  types. This is due to the fact that every complex multiplication
  involves two mul ops and one add op. Further added test-cases
  for API level accuracy check, that includes special cases of
  alpha and beta.

- Decomposed the reference call to ?AXPBYV with several other
  L1 BLAS APIs(in case of the reference not supporting its own
  ?AXPBYV API). The decomposition is done to match the exact
  operations that is done in BLIS based on alpha and/or beta
  values. This ensures that we test for our own compliance.

AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
2024-06-20 16:22:07 +05:30
Arnav Sharma
aa3adb8d69 Updated DOTXF and AXPYF Kernels
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
  when b_n > fuse_factor.

- The fused kernels are expected to invoke respective Level-1 kernels
  iteratively when b_n > fuse_factor.

AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
2024-06-20 04:41:41 -04:00
Mangala V
e9124ffca7 BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C

FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0

Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C

- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite

AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
2024-06-20 02:58:43 -04:00
Meghana Vankadari
c9254bd9e9 Implemented LPGEMV(n=1) for AVX2-INT8 variants
- When n=1, reorder of B matrix is avoided to efficiently
  process data. A dot-product based kernel is implemented to
  perform gemv when n=1.

AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
2024-06-18 12:09:18 +05:30
Edward Smyth
1f60b7c366 Export some BLIS internal symbols 2
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.

AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
2024-05-15 06:59:56 -04:00
Edward Smyth
62c886feee Export some BLIS internal symbols
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.

AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
2024-05-08 12:51:32 -04:00
mkadavil
118e955a22 SWISH post-op support for all LPGEMM APIs.
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1.

AMD-Internal: [SWLCSG-2387]
Change-Id: I55f50c74a8583a515f7ea58fa0878ccbcdd6cc26
2024-05-06 06:05:11 -04:00
Shubham Sharma
b9e21e8701 Added ZTRSM AVX512 small code path
- Kernel dimensions are 4x4.
  - Two kernels are implemented, Right Upper and
    Right lower.
  - In case of Left variants of TRSM, transpose is
    induced so that Right variant kernels can be used.
  - No packing is performed in these kernels.
  - Changes are made in the threshold to pick ZTRSM small
    code path.
  - BLIS_INLINE is removed from signature of
    "TRSMSMALL_KER_PROT".
  - These kernels do not support "ENABLE_TRSM_PREINVERSION".
  - Newly added kernels do not support conjugate
    transpose.
  - Added multithreading to ZTRSM small code path.

AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
2024-05-03 05:10:41 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Shubham Sharma
14bab0eb17 Fixed out of bounds read in CTRSM small kernel
- In 2x1 fringe case in [RUN/RLT] kernel, 3 scomplex
  precision numbers are being read instead of 1 scomplex.

- Fixed the code to read only one scomplex.

AMD-Internal: [CPUPL-4403]
Change-Id: If3ac03ed864618382d3a382a8cdff7ff8a94eb7d
2024-04-16 02:42:34 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Nallani Bhaskar
5070343318 Fixed load intrinsic in aocl-gemm addon f32 api
Description:

1. Replaced aligned load intrinsics _mm512_load_ps
   with unaligned load intrinsics _mm512_loadu_ps.
2. There is no guarantee that the memory address
   can be aligned everywhere. The changes are under
   beta multiplication. Copy paste error.

Change-Id: I978231b556e17ad7e66c5028ed1cd904c653e0a8
2024-03-20 06:24:32 -04:00
Bhaskar Nallani
2ce47e6f5e Implemented optimal AVX512-variant of f32 LPGEMV
1. The 5 LOOP LPGEMM path is in-efficient when A or B is a vector
   (i.e, m == 1 or n == 1).

2. An efficient implementation of lpgemv_rowvar_f32 is developed
   considering the b matrix reorder in case of m=1 and post-ops fusion.

3. When m = 1 the algorithm divide the GEMM workload in n dimension
   intelligently at a granularity of NR. Each thread work on A:1xk
   B:kx(>=NR) and produce C=1x(>NR).  K is unrolled by 4 along with
   remainder loop.

4. When n = 1 the algorithm divide the GEMM workload in m dimension
   intelligently at a granularity of MR. Each thread work on A:(>=MR)xk
   B:kx1 and produce C = (>=MR)x1. When n=1 reordering of B is avoided
   to efficiently process in n one kernel.

5. Fixed few warnings while loading 2 f32 bias elements using
   _mm_load_sd using float pointer. Typecasted to (const double *)

AMD-Internal: [SWLCSG-2391, SWLCSG-2353]
Change-Id: If1d0b8d59e0278f5f16b499de1d629e63da5b599
2024-03-04 23:53:23 +05:30
mkadavil
d00e84ced3 Matrix Add post-operation support for float(bf16|f32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.

AMD-Internal: [SWLCSG-2424]
Change-Id: I9464d1f514e3b04275fe93441489b4503a08937a
2024-02-23 02:02:33 -05:00
mkadavil
01b7f8c945 Matrix Add post-operation support for integer(s16|s32) LPGEMM APIs.
-This post-operation computes C = (beta*C + alpha*A*B) + D, where D is
a matrix with dimensions and data type the same as that of C matrix.
-For clang compilers (including aocc), -march=znver1 is not enabled for
zen kernels. Have updated CKVECFLAGS to capture the same.

AMD-Internal: [SWLCSG-2424]
Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09
2024-02-12 23:51:36 +05:30