3719 Commits

Author SHA1 Message Date
Hari Govind S
3d2653f1ab DDOTV Optimization for ZEN3 Architecture
- Reduced the blocking size of 'bli_ddotv_zen_int10'
  kernel from 40 elements to 20 elements for better
  utilization of vector registers

- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
  kernel with 'if' conditions to handle reminder
  iterations. As only a single iteration is used when
  reminder is less than the primary unroll factor.

- Added a conditional check to invoke the vectorized
  DDOTV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal
  for single-threaded execution. Thus, we avoid the
  call to bli_nthreads_l1() function to set the ideal
  number of threads.

- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
  kernel.

AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
2025-02-04 06:01:04 -05:00
Edward Smyth
bec9406996 Export some BLIS internal symbols (3)
libFLAME calls DAMAX kernel directly. Now that AVX512 version
has been enabled in BLIS cntx, export this symbol.

AMD-Internal: [CPUPL-5895]
Change-Id: I4c74150578f49eb643b0f68c6cc32ee2bb23bec2
2025-02-03 06:30:14 -05:00
Shubham Sharma
7695561f4e Tuned DGEMM blocksizes for ZEN5
- In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500
  were not tuned properly for cases when application would use more than
  one instance of blis in parallel.
- Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes.
  Such sizes are used by applications like HPL.

AMD-Internal: [SWLCSG-3338]
Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418
2025-02-03 05:40:07 -05:00
Shubham Sharma
50306c4854 Tuned ZEN4 DGEMM blocksizes
- Blocksizes for sizes where M >> K, N >> K and K < 500 were tuned by running
   blis bench on only one MPI rank. Blocksizes tuned this way are not performing
   well for all configurations.
- Retuned the blocksizes so that performance is good for such skinny sizes.

AMD-Internal: [CPUPL-6362]
Change-Id: I89c61889df2443ef6bf0e87bf89263768b5c00c1
2025-02-03 03:05:50 -05:00
Hari Govind S
67322416d3 Added support to benchmark ASUMV APIs
- Implemented the feature to benchmark ?ASUMV APIs
  for the supported datatypes. The feature allows to
  benchmark BLAS, CBLAS or the native BLIS API, based
  on the macro definition.

- Added a sample input file to provide examples to benchmark
  ASUMV for all its datatype supports.

AMD-Internal: [CPUPL-5984]
Change-Id: Iff512166545687d12504babda1bd52d71a3a5755
AOCL-Feb2025-b1
2025-01-31 06:04:16 -05:00
Vignesh Balasubramanian
0e71d28c01 Additional bug-fix for AOCL-BLAS bench
- Corrected the format specifier setting(as macro) to not
  include additional spaces, since this would cause incorrect
  parsing of input files(in case they have exactly the expected
  number of parameters and not more).

- Updated the inputgemm.txt file to contain some inputs that
  have the exact parameters, to validate this fix.

AMD-Internal: [CPUPL-6365]
Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49
2025-01-30 08:28:14 -05:00
Deepak Negi
4ade159800 Buffer scale support for matrix add and matrix mul post-ops in f32 API.
Description:
1. Support has been added to scale buffer values using both scalar and
   vector scale factors before matrix add or matrix mul post-ops.

AMD-Internal: CPUPL-6340

Change-Id: Ie023d5963689897509ef3d5784c3592791e57125
2025-01-30 07:09:18 -05:00
Shubham Sharma
26bd265cfd Optimized DTRSV for tiny sizes
- Replaced switch case with if else, lookup table for switch case
  is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
  query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.

AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
2025-01-30 01:23:09 -05:00
harsh dave
c5e842e8d3 Revert changes to force dgemm inputs under threshold to arch-specific kernels in single-threaded mode
- This patch reverts the previous changes that removed the enforcement
  of dgemm inputs under a certain threshold to be processed by kernels
  selected based on architecture ID and handled in single-threaded mode.

- This change is now forcing such small inputs to be computed in tiny
  path. Previously when this check was not there, it was routing these
  inputs to SUP path and causing performance regression due to framework
  overhead.

AMD-Internal: [CPUPL-5927]
Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb
AOCL-LPGEMM-012925
2025-01-29 04:22:16 -05:00
Nallani Bhaskar
805bd10353 Updated all post-ops in u8s8s32 API to operate in float precision
Description:

1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
   on float data. All the post-ops are updated to operate on f32
   by converting s32 accumulator registers to float at the end of k
   loop. Changed all post-ops to operate on float data.

2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
   the output in u8

AMD-Internal - SWLCSG-3366

Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
2025-01-29 04:21:17 -05:00
Vignesh Balasubramanian
445327f255 Bugfix for AOCL-BLAS bench application
- Bug : When configuring our library with the native
        BLIS integer size being 32, the bench application
	would crash or read an invalid value when parsing
        the input file. This is because of a mismatch
        of format specifier, that we hardset in the
        Makefile.

- Fix : Defined a header that sets the format specifiers
        as macros with the right matching, based on how we
        configure and build the library. It is expected to
        include this header in every source file for
        benchmarking.

AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
2025-01-29 03:25:57 -05:00
Shubham Sharma
9fd2aebd25 Tuned DTRSM thresholds for ZEN5
- Tuned AOCL_dynamic thresholds for DTRSM for ZEN5.
- Tuned thresholds for better selection of code path.

AMD-Internal: [CPUPL-5408]
Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be
2025-01-28 07:33:10 -05:00
Edward Smyth
d9cabce0ba GTestSuite: Catch BLIS version executable errors
In case the executable to obtain the BLIS library version fails,
catch and report common errors to help with debugging.

Also correct the test for bli_info_get_info() support to mark
that it is not available in any AOCL version <= 4.1

AMD-Internal: [CPUPL-4500]
Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8
2025-01-28 16:54:05 +05:30
Vignesh Balasubramanian
327142395b Cleanup for readability and uniformity of Tiny-ZGEMM
- Guarded the inclusion of thresholds(configuration
  headers) using macros, to maintain uniformity in
  the design principles.

- Updated the threshold macro names for every
  micro-architecture.

AMD-Internal: [CPUPL-5895]
Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c
2025-01-27 15:48:31 +05:30
Deepak Negi
db407fd202 Added F32 bias type support, F32, BF16 output type support in int8 APIs
Description:

1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
   Where the inputs are uint8/int8 and the processing is done using
   VNNI but the output is stored in f32 and bf16 formats. All the int8
   kernels are reused and updated with the new output data types.

2. Added F32 data type support in bias.

3. Updated the bench and bench input file to support validation.

AMD-Internal: SWLCSG-3335

Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
2025-01-26 11:38:30 -05:00
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Hari Govind S
c149b5a98b Optimisation of DCOPY kernel in zen3
-  Using 'if' condition instead of 'for'loop to handle fringe
   cases. 'for' loop is redundant for handling reminder iterations
   as only a single iteration is used when reminder is less than
   primary unroll factor.

AMD-Internal: [CPUPL-5594]
Change-Id: I8cebc037742ee47961869e22e2471e550fcd99e9
2025-01-24 05:55:59 -05:00
Hari Govind S
106a2b1fe1 Gtestsuite: UKR test for GEMV kernels
-  Added support for gemv kernels unit test in gtestsuite.
-  Added micro-kernel tests and memory tests for DGEMV
   transpose case kernels.

AMD-Internal: [CPUPL-5835]
Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201
2025-01-24 05:09:33 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Mithun Mohan
39289858b7 Packed A matrix stride update to account for fringe cases.
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.

AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
2025-01-22 05:42:30 -05:00
Arnav Sharma
66461b8df3 Improved Multi-threaded Performance of DSCALV
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
  architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
  incurred when the single-threaded path is optimally performant.

AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
2025-01-22 03:55:38 -05:00
Meghana Vankadari
69ca5dbcd6 Fixed compilation errors for gcc versions < 11.2
Details:
- Disabled intrinsics code of f32obf16 pack function
  for gcc < 11.2 as the instructions used in kernels
  are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
  gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
  batch_gemm API for bf16 datatype.

AMD Internal: [CPUPL-6327]

Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
2025-01-21 07:13:31 -05:00
Edward Smyth
3c2dedb13c Restore or add update from env in bli_thread_get APIs
We want bli_thread_get_num_threads() and bli_thread_get_*_nt()
to report the threading values modified to reflect what will
be in effect given OpenMP nesting and active levels. This was
lost in commit 0c6d006225 for
bli_thread_get_num_threads() and wasn't previously implemented
in bli_thread_get_*_nt()

AMD-Internal: [CPUPL-6168]
Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd
2025-01-20 08:58:22 -05:00
Deepak Negi
994098dd35 Fix for a bias data type in u8s8s32/s8s8s32 gemv APIs.
Description:
Added _mm512_cvtps_epi32 for bf16 to s32 conversion in gemv APIs.

AMD-Internal: SWLCSG-3302

Change-Id: I7e3e6da8f50d1f7177629cb68ac21e3bbce40bee
AOCL-Jan2025-b2
2025-01-16 01:43:38 -05:00
Deepak Negi
182a6373b5 Added support to specify bias data type in u8s8s32/s8s8s32 API's
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
   type(s8/s32/bf16) irrespective of the storage data type in
   u8s8s32/s8s8s32 API's.

AMD-Internal: SWLCSG-3302

Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4
2025-01-15 05:56:26 -05:00
Arnav Sharma
2f2741f4ab Fixed Bug in test_scalv
- Since bli_obj_length(&a) is being used to get the length of A vector,
  we need to initialize A as a vector of M x 1 dimension as the
  bli_obj_length(...) will return the M dimension of the object.
- The vector was being initialized with a dimension of 1 x N resulting
  in bli_obj_length(&a) always returning 1 as the length of the vector.

AMD-Internal: [CPUPL-6297]
Change-Id: Id0e79752f9b81c1573deda3dd32ef0fef10df50c
2025-01-14 23:53:53 -05:00
Vignesh Balasubramanian
a80436ab21 Standardizing the EVT compliance of {S/D}AMAXV API
- Updated the existing AVX2 {S/D}AMAXV kernels to comply
  to the standard when having exception values. This makes
  it exhibit the same behaviour as it AVX512 variants.
  Provided additional optimizations with loop unrolling.

- Removed redundant early return checks inside the kernels,
  since they have been abstracted to a higher layer.

- Updated the unit-tests(micro-kernel) and exception value
  tests for appropriate code-coverage. Also re-enabled the
  exception value tests.

AMD-Internal: [CPUPL-4745]
Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9
2025-01-13 06:56:31 -05:00
harsh dave
7510e27007 DGEMM Optimizations
Refined thresholds to decide between native and sup DGEMM code-paths for both zen4 and zen5 processors.

AMD-Internal: [CPUPL-6300]
Change-Id: Ib32a256dba99a0a92b7ecaa7684443a66c459566
2025-01-13 01:09:39 -05:00
Edward Smyth
0ae5a0492f GTestSuite: fix to ukr tests for dgemm avx512 8x24 kernels
- Restore test for old bli_dgemm_zen4_asm_8x24 kernel, so that
  we can test this if linking with older AOCL versions.
- Move K_bli_dgemm_avx512_asm_8x24 definition from AOCL_42 list
  to AOCL_50 list.

AMD-Internal: [CPUPL-4500]
Change-Id: Id522f4bc5b89e86f77c4e1d26c75e261736ab450
2025-01-10 12:33:15 -05:00
Vignesh Balasubramanian
8e660215c3 Introduced fast-path in DAXPYV API
- Added a conditional check to invoke the vectorized
  DAXPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

AMD-Internal: [CPUPL-4878]
Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb
2025-01-10 09:19:38 -05:00
Vignesh Balasubramanian
345204d69b Additional updates to the thresholds for ZGEMM small path
- Further updated the thresholds for entry to ZGEMM small
  path(AVX2), when the execution is mulithreaded. The newer
  thresholds account for more skinnier inputs, compatible with
  single-threaded small path, as opposed to multithreaded
  SUP path.

AMD-Internal: [CPUPL-6040][CPUPL-5930]
Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6
2025-01-10 08:29:31 -05:00
Edward Smyth
97ede96ed4 Correct duplicate object file names
Some kernel file names were the same for different sub-configurations,
which could result in duplicate copies of the same object being archived
depending upon the order of (re-)compiling the source files. Rename the
files to be specific to each sub-configuration to avoid this problem.

AMD-Internal: [CPUPL-5895]
Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb
2025-01-10 06:03:36 -05:00
Mithun Mohan
7a25505f5c Simulation of spread like pattern in worker thread to core binding in LPGEMM.
-In multi-threaded cases if a packed/close pattern thread to core
binding is used (e.g.: OMP_PROC_BIND=close and OMP_PLACES=core|threads),
LPGEMM (OMP framework) launches threads such that threads with adjacent
id's are bound to nearby (even adjacent) cores. Depending on the
processor architecture, multiple threads with adjacent id's can be bound
to cores sharing the same last level cache. However it was observed that
when these threads (with adjacent id's) access the B reorder buffer, the
last level cache access was suboptimal. This can be attributed to the
per thread reorder buffer block accesses and how it maps to the last
level cache.
-In these cases, m is small (<= 4 * MR) and n value is such that number
of NR blocks (n/NR) is less than available threads nt (like < 0.5 * nt).
In such cases, id's of the threads can be modified such that the number
of threads with adjacent id's bound to the last level cache can be
reduced. This looks similar to the spread pattern used in thread to core
binding. This reduces the load on last level cache due to reorder buffer
access and improves performance in these cases. A heuristic method is
used to detect whether thread to core binding follows close pattern
before applying the thread id modifications.

AMD-Internal: [SWLCSG-3185]
Change-Id: Ie3c87d56e0f7b59161a381f382cf4e2d5d02a591
2025-01-10 06:02:06 -05:00
Edward Smyth
567039a7fe Fortran interfaces for bli_thread_get APIs
Create and export Fortran interfaces for bli_thread_get_num_threads()
and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs.

bli_thread_get_is_parallel() is intended for internal BLIS usage, so
not adding a Fortran interfaces for it at this time.

AMD-Internal: [CPUPL-6168]
Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1
2025-01-10 05:07:33 -05:00
Kiran Varaganti
eb8efd742d Added guard for including omp.h
aoclos.c:20 is #include <omp.h> but this needs to be guarded by e.g. #ifdef BLIS_ENABLE_OPENMP
otherwise it leads to compilation failure if the environment does not have OpenMP available.
(affected platform: clang 20 on ubuntu 24.04LTS)
Issue reported in https://github.com/amd/blis/issues/25, thanks to maychiew1988.

Change-Id: I4ea0b85f8194345f0534e17229acb4827193dfe6
2025-01-10 04:54:34 -05:00
Meghana Vankadari
852cdc6a9a Implemented batch_matmul for f32 & int8 datatypes
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.

AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
2025-01-10 04:10:53 -05:00
Mithun Mohan
ef4286a97e Multi-data type buffer and scale support for matrix add|mul post-ops in s32 API.
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.

AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
2025-01-10 02:11:12 -05:00
Meghana Vankadari
051c9ac7a2 Bug fixes in F32 and INT8 APIs
Details:
- Fixed few bugs in downscale post-op for f32 datatype.
- Fixed a bug in setting strides of packB buffer in
  int8 APIs.

Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c
2025-01-09 04:07:26 -05:00
varshav
7b9d29f9b3 Adding post-ops for JIT kernels
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
 - Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
 - Added the optional support for B-prefetch in the JIT kernels
 - Resolved the visibility issues in global variable jit_krnels_generated
 - Modified the array generation for scale and zp values in the bench

Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
2025-01-08 12:55:27 +00:00
Mithun Mohan
4a95f44d39 Buffer scale support for matrix add and matrix mul post-ops in bf16 API.
-Currently the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add/mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.

AMD-Internal: [SWLCSG-3181]
Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d
2025-01-08 10:35:50 +00:00
harsh dave
aeda581539 Fix: Rename bli_tiny_gemm.c to bli_tiny_gemm_amd.c
When compiling with config generic (or any non-zen build),
the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny()
is only used within amd specific file, bli_tiny_gemm.c has been renamed
to bli_tiny_gemm_amd.c to reflect its specific usage.

Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue.

Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307
2025-01-07 23:29:32 -05:00
Vignesh Balasubramanian
cdaa2ac7fd Bugfix and optimizations for AVX512 AMAXV micro-kernels
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
  incorrect functionality with multiple absolute maximums.
  They returned the last index when having multiple occurences,
  instead of the first one.

- Implemented a bug-fix to handle this issue on these AVX512
  kernels. Also ensured that the kernels are compliant with
  the standard when handling exception values.

- Further optimized the code by decoupling the logic to find
  the maximum element and its search space for index. This way,
  we use lesser latency instructions to compute the maximum
  first.

- Updated the unit-tests, exception value tests and early return
  tests for the API to ensure code-coverage.

AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian
f548f42607 Fixing compiler warnings on ZGEMM
- Scoped some of the variables used in zgemm_blis_impl()
  when determining the thresholds to small path. These
  variables will be used only when the architecture is
  ZEN5 or ZEN4.

AMD-Internal: [CPUPL-5895]
Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13
2025-01-07 10:59:43 +05:30
Meghana Vankadari
c9f0240679 Returning early for col-major inputs in u8s8s32os32|s8 APIs
Details:
- For u8s8s32os32|s8 APIs, A & B matrices are of different
  datatypes. Hence col-major inputs cannot be supported by
  swapping the matrices internally. Added a check to return
  early in such cases.

Change-Id: I99fbebe811c3d05310f30f7fc978f5084b5a51ba
AOCL-Jan2025-b1
2025-01-05 23:46:06 +05:30
harsh dave
ea4212c550 Increase buffer size to prevent segmentation fault
Since the threshold for tiny path was large but the buffer size was
not enough to store the complete packed matrix. That is leading to
segmentation faults.

This commit fix the buffer size as per the threshold of tiny gemm path.
With the corrected buffer size, the matrix is packed correctly.

AMD-Internal: [CPUPL-6201]
Change-Id: I0292a07f6146e7f1ccd8c1010b4c41c218fd9b47
2025-01-03 06:39:50 -05:00
Mithun Mohan
8d8a8e2f19 Light-weight logging framewok for LPGEMM.
-A light-weight mechanism/framework to log input details and a
stringified version of the post-ops structure is added to LPGEMM.
Additionally the runtime of the API is also logged.
The logging framework logs to a file with filename following the format
aocl_gemm_log_<PID>_<TID>.txt.
-To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to
be defined when compiling BLIS (with aocl_gemm addon enabled) by passing
CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally
AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during
LPGEMM runtime.

AMD-Internal: [SWLCSG-3280]
Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb
2025-01-03 11:28:57 +00:00
Nallani Bhaskar
6cb1acf3c3 Fixed out-of-memory read access in bf16 reorder reference
Description:

Loop count was taken as 16 instead of n0_partial_rem in packb_nrlt16_bf16bf16f32of32_col_major_ref function.

Updated comments on reference reorder functionality.

AMD Internal: SWLCSG-3279

Change-Id: Idfc3b92906bc2b24651c7923e395fe10db56166b
2025-01-03 04:09:08 -05:00
Meghana Vankadari
bfc512d3e1 Implemented batch_gemm for bf16bf16f32of32|bf16
Details:
- The batch matmul performs a series of matmuls, processing
  more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
  to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
  different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
  GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
  problems equally irrespective of their dimensions which
  leads to better performance for batches with identical GEMMs
  but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.

AMD-Internal: [SWLCSG-2944]
Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155
2025-01-03 03:28:32 -05:00
Shubham Sharma
8f99d8a5bb Fixed warnings and compilation issues with GCC in TRSM
- Current implementation uses macros to expand the code at
  compile time, but this is causing some false warning in GCC12 and 14.
- Added switch case in trsm right variants for n_remainder.
- This ensures that n_rem is compile time constant, therefore
   warnings related to array subscript out of bounds are fixed.
- mtune=znver3 flag is causing compilation issue in GCC 9.1,
  therefore this flag is removed.
- Remaned the file bli_trsm_small to bli_trsm_small_zen5 in order
  to avoid possibily of missing symbols.

AMD-Internal: [CPUPL-6199]
Change-Id: Ib8e90196ce0a41d38c2b29226df5ab6c2d8ba996
2024-12-18 06:22:05 -05:00
Edward Smyth
4ce708c316 Move some BLAS extension APIs to extra subdirectories
In preparation for merging next group of changes from upstream BLIS,
move some BLAS extension APIs to new extra subdirectories in
frame/compat and frame/compat/cblas/src. Other extension APIs will
be moved in later commits.

Some tidying up to better match upstream BLIS code has also been done.

AMD-Internal: [CPUPL-2698]
Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb
2024-12-17 04:54:39 -05:00