Commit Graph

3540 Commits

Author SHA1 Message Date
Nallani Bhaskar
9735391e1d Implemented f32tobf16 reorder function
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.

The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.

The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float

Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
2024-11-04 04:32:01 +00:00
Mithun Mohan
097cda9f9e Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API.
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.

AMD Internal: [CPUPL-5959]

Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
2024-10-30 08:52:22 +00:00
Edward Smyth
9ce2696fc9 GTestSuite: Fix builds testing against MKL
Correction to CMakeLists.txt to fix problem building executables
when testing against MKL.

AMD-Internal: [CPUPL-5928]
Change-Id: Ie427fff0afb48be6ce6d940b1db2c9d1c7a40e5b
2024-10-29 06:32:27 -04:00
Edward Smyth
cffb501e00 GTestSuite: ILP64 build fix
Cast literal 0 to match integer size in std::max tests.

AMD-Internal: [CPUPL-4500]
Change-Id: I330aafd8669884c5e1900b95742b5d1e4ce8ddfa
2024-10-29 06:10:49 -04:00
Chandrashekara K R
62da6e6163 Added "NOTICES" file.
Change-Id: I3022942f46983a7b80c2b1ca39ce6b013e303768
2024-10-25 14:41:49 +05:30
Meghana Vankadari
b04b8f22c9 Introduced un-reorder API for bf16bf16f32of32
Details:
- Added a new API called unreorder that converts a matrix from
  reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
  API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
  row Vs col storage of matrix and transposes for bf16 datatype.

Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
2024-10-23 07:49:24 -04:00
Deepak Negi
6ddde390aa Added support for column major B matrix in BF16S4F32F32 reorder API.
-Added new pack kernels that packs/reorders B matrix (odd strides) from
 column-major input format. This also supports the transB scenario if
 input B matrix is row major.

Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853
2024-10-22 03:01:39 -04:00
varshav2
dabfdf484a Add Scale post-op for F32 API
- Implemented the Scale post-op for the F32 API for all kernels
 - f32_scale = (f32 * scale_factor) + offset
 - Added the bench inputs

Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
2024-10-21 06:08:31 -04:00
Chandrashekara K R
cfbe202868 Updated version string from 4.2.1 to 5.0.1
Change-Id: I4cbd8d9ae7e35fa235a6707fe7ddbd157eb63b98
(cherry picked from commit 7f1824b8ee)
2024-10-17 03:07:17 -04:00
Chandrashekara K R
8cf0f92559 Updated AOCL-BLAS 5.0 EULA.
Change-Id: Ibc85b2df01f5d118087b2459f890e14a20dd680a
2024-10-09 11:56:05 +05:30
Eleni Vlachopoulou
d6a411d6b6 GTestSuite: Reorganizing some tests
- Breaking tests to smaller executables.
- Removing some redundant tests.

AMD-Internal: [CPUPL-4500]
Change-Id: I6288c3fcf5194ccb5de3485ca1ad95a20414208c
2024-10-02 11:48:18 -04:00
Shubham Sharma.
2b02024404 Fixed bug in bli_zdotv_zen4_asm_avx512 kernel.
-  Data-type of n, and conj is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(n)),
   the "n" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.
Thanks to mangala.v@amd.com for finding this bug.

Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5
2024-09-24 02:33:44 -04:00
Shubham Sharma
1833ee70cd Fixed bug in bli_dgemm_avx512 8x24 native kernel.
-  Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case.
-  When loading 64-bit registers using "mov" instructions, mov(rax, var(m)),
   the "m" should be 64-bit otherwise incorrect values gets loaded.

Fix: We typecast these variables to int64_t before loading into registers.

AMD-Internal: [CPUPL-5819]
Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0
2024-09-23 04:54:04 -04:00
Deepak Negi
16653ed208 Added support for column major B matrix in BF16S4F32F32 reorder API.
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.

Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
2024-09-20 01:11:21 +05:30
Eleni Vlachopoulou
72536e56ba GTestSuite: Reducing gemm tests.
Since there is thorough kernel testing, we reduce the number of "Black Box" test cases so that CI is faster.

AMD-Internal: [CPUPL-4500]
Change-Id: Ie57eeccff8103c0051eb1904162d6447da0ef102
2024-09-19 12:17:20 -04:00
Edward Smyth
6330ac6a52 GTestSuite: Misc changes
- Correct matsize and NumericalComparison functions for
  tests with first matrix dimension <= 0.
- BLAS1:
  - Fix for BLAS vs CBLAS differences in amaxv IIT_ERS tests.
  - Threshold adjustments in ddotxf and zaxpy.
  - Break axpyv and scalv into separate executables for
    each data type.
- BLAS2:
  - Threshold adjustments in symv and hemv.
  - Break ger into separate executables for each data type.
- UKR:
  - Break gemm and trsm ukr test into separate executables
    for each data type.
  - Threshold adjustments in daxpyf
  - Disable {z,c}trsm ukr tests when BLIS_INT_ELEMENT_TYPE
    is used, as matrix generator is not currently suitable
    for this.

AMD-Internal: [CPUPL-4500]
Change-Id: I1d9e7acc11025f1478b8b511c14def5517ef0ae6
2024-09-19 10:17:36 -04:00
Vignesh Balasubramanian
4da1ad2cd9 Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs
- Added the appropriate CBLAS wrappers for CROTG, CSROT,
  ZROTG and ZDROT APIs. These would internally call their
  ?_blis_impl() layer.

AMD-Internal: [CPUPL-5813]
Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97
2024-09-19 08:49:46 -04:00
varshav2
605517964b Add Transpose Kernel for A matrix in F32F32f32Of32
- Implemented the AVX512 packA kernel for col major inputs in F32 API
 - Removed the work arounds for n = 1,  mtag_a = PACK case, where the execution was
   being directed to GEMM instead of GEMV.

Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
2024-09-19 06:37:11 -04:00
Shubham Sharma
b3b56ae3bb BugFix: Fixed GEMM mixed precision failure in ZEN5
- Optimized DGEMM macro kernel does not
  support mixed precision.
- This kernel was being used for solving
  some of the mixed precision problems.
- Currently only ( bli_obj_elem(A) == 8 ) is used for checking
  if the problem being solved is mixed precision.
- bli_obj_elem(A) will be equal to 8 for both double precision
  data type and mixed precision case single-complex.
- Added extra checks (bli_obj_is_real( a )) to make sure that
  A and B are real and DGEMM macro kernel is being used only
  for DDDGEMM.

AMD-Internal: [CPUPL-5804]
Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9
2024-09-19 04:54:53 -04:00
Arnav Sharma
df0ce5a799 Bugfix: Fix for gemmsup_r Reference Kernels
- The existing row-preferred reference kernels for GEMM SUP path were
  not taking into consideration the packing state of matrices A or B.
  Thus, whenever either or both A and B matrices were packed the
  kernel was unable to iterate appropriately through the matrices
  thereby calculating incorrect values resulting in failures.

- Though, for generic configuration, the SUP path is disabled by default
  the set of Pack and Compute Extension APIs use these kernels thus,
  this issue resulted in their failures as well.

- With this patch, the loops being used in these kernels have been fixed
  to iterate over steps of MR and NR while also accounting for the
  fringe cases. Within the updated loops, temporary pointers used to
  point to the correct block/panel of the matrices are incremented with
  panel strides of respective matrices.

AMD-Internal: [CPUPL-5674]
Change-Id: Ic3939877c79ebb9ccf9e53b1d1672cea4b8c5959
2024-09-18 13:52:28 -04:00
Eleni Vlachopoulou
c7a5d04d4d GTestSuite: Disabling falling tests.
Those can be run in --gtest_also_run_disabled_tests is used.
Bugs will be addressed and resolved in the future.

AMD-Internal: [CPUPL-4500]
Change-Id: I7a5443606ea8ef20f18ff8beec14bece5f6ee661
2024-09-18 13:12:35 +01:00
Edward Smyth
54f8fb951e GTestSuite: BLAS2 test case selection
Various changes to BLAS2 test cases:
- GEMV: Reduce number of tests to make runtime more reasonable.
- TRSV:
  - Standardize tests across different data types, including
    adding memory testing for all variants.
  - Improve scaling when making matrix A diagonally dominant and
    avoid singular matrix when BLIS_INT_ELEMENT_TYPE is used.
- TRMV: Copy TRSV generic tests.
- Expand set of tests for HEMV, HER, HER2, SYMV, SYR, SYR2 and
  make lda contribution to test names consistent with others
  routines.
- Various adjustments to thresholds added.

Update gtestsuite documentation to describe using GTEST_FILTER
environment variable to select tests to run or exclude. This
works particularly well when using ctest, as we do not enumerate
all the tests at this level and so need to pass the selection
down to the individual executables.

AMD-Internal: [CPUPL-4500]
Change-Id: Ifcb6410455b7f91e58b555f94b9fd7920d7ad9d9
2024-09-17 09:35:29 -04:00
Edward Smyth
61c6f1ad78 GTestSuite:a Fix alpha and beta input argument tests
Check if alpha and beta are null before testing values. This
avoids possible seg faults if alpha or beta have not been
defined in IIT tests.

AMD-Internal: [CPUPL-4500]
Change-Id: Ibbf2d6a8fb38d9a95033f3fec3d06c3441e98689
2024-09-17 09:00:09 -04:00
Chandrashekara K R
e4eed817aa Added logic to use right format specifier to read integer value.
Updated logic to use "%ld" and "%lld" format specifiers to read
64-bit integer from input files using fscanf function on Linux and
Windows respectively when the user set INT_SIZE='auto' on 64-bit
machine or INT_SIZE='64'. Otherwise "%d" on both windows and Linux
for benchmarking blis and LPGEMM.

Change-Id: I4762c4c1b3fcd09cf66d0cc9572d38766be6be60
2024-09-17 04:48:59 -04:00
Edward Smyth
8d4881c4fd GTestSuite: add option to test blis_impl layer
Add BLAS_TEST_IMPL option for TEST_INTERFACE to test the
wrapper layer underneath BLAS and CBLAS interfaces. This is
particularly useful if building a BLIS library with these
interfaces disabled, e.g.

   ./configure --disable-blas amdzen
or
   cmake . -DENABLE_BLAS=OFF -DBLIS_CONFIG_FAMILY=amdzen

The ?_blis_impl wrappers should have the same arguments as the
BLAS interfaces, thus we define TEST_BLAS_LIKE as an additional
definition for convenience when selecting tests and options in
the C++ files.

AMD-Internal: [CPUPL-5650]
Change-Id: I0275a387563f3efc2b40029950c8569956f2df7b
2024-09-16 09:53:56 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Chandrashekara K R
91d4337b8b Updated format specifier for fscanf to read double values.
Updated format specifier to read signed double("%lld") and unsigned
double("%llu") from file using fscanf from both windows and Linux.

AMD-Internal: [CPUPL-5787]
Change-Id: Ibef50b0df708f474e22f703240e264eff1de3994
2024-09-13 14:57:28 +05:30
Vignesh Balasubramanian
ed682a429d Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels
- Added explicit typecast to the pointers that are passed
  to the _mm_prefetch( ... ) intrinsic, to avoid compiler
  warnings.

AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea
2024-09-12 11:37:06 +05:30
varshav2
7c78b9991f Bug Fixes in the F32F32 m == 1 transpose scenario
- added the missing stride updates in B reorder case in GEMV
 - added the missing stride updates for the cast of transA with B
   reordered case.

Change-Id: Ic89781dfa7c0d9380ea523796958f795828a1ade
2024-09-11 02:08:50 -04:00
Mithun Mohan
453c9f0084 Fixes for bfloat16 accumulation rounding errors in bench.
For the bf16bf16of32bf16 lpgemm api, inside the micro-kernels in order
to convert the accumulated float values to bfloat16 before storing,
the _mm512_cvtneps_pbh intrinsic (vcvtneps2bf16) is used. This
intrinsic rounds the value based on a rounding bias logic. Replicating
the same rounding logic inside the bf16 bench accuracy check function
to get proper one to one comparison of output values.

AMD Internal: [SWLCSG-2948]

Change-Id: I135ac39ac8484769b6c0fe5b3e351dd22d7ca1d8
2024-09-11 01:39:11 -04:00
Meghana Vankadari
5120f98e12 Developed all WoQ kernels for bf16s4f32o<f32|bf16>
Description:

1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
   weights into bf16 performed in the kernel itself to reduce bandwidth.

2. These kernels are performing better compared to bf16 weights when m
   is small and n is large.

3. Established a threshold to do quantization support at packing of
   B (KCXNC) level or WoQ kernel level.

Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289
2024-09-10 12:00:10 +00:00
varshav2
298a165718 Add TransA and TransB support for F32F32F32oF32
- Added support for TransA and transB in f32f32of32 APIs
 - Modified the GEMV case(m == 1) to support PACKB feature
 - Redirecting the operations to GEMM instead of GEMV in case of n == 1
   conditions, with storage scheme r/transA and c/transB to avoid the
   packing errors which would lead to failures in computation.

Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9
2024-09-09 05:19:49 +05:30
Chandrashekara K R
5ada963b4c Clang version requirement is updated in cmake
Description:
Due to the latest VNNI instructions are supported only from Clang
Version 18 and above, updated clang version check from 17 to 18.

AMD-Internal: [CPUPL-5744]
Change-Id: I4a3ecec65bd88d9dccfe1018fb25cb7be29946f0
2024-09-09 06:42:36 -04:00
Mithun Mohan
cf123aa926 Disabling smart threading for small input dimensions.
-It has been observed that reduction of threads as part of smart
threading for smaller input dimensions hampers the performance of the
other inputs with larger dimensions due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (small m and n)
fixes this issue.
-Bug fixes related to work split in LPGEMV for n < NR and m < MR cases.

AMD Internal: [SWLCSG-2948]

Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121
2024-09-06 09:20:42 -04:00
Meghana Vankadari
687abe4c96 Bug fix in WOQ kernel for m=4 case.
- Updated pre_op_off computation for nr0 < NR cases.
- Fixed warnings in bench file.

Change-Id: Iae30fa84b6b47ebd94ab05d2139056aee24546d7
2024-09-05 05:00:30 +00:00
Meghana Vankadari
2e1cc2f14a Added bf16s4f32 kernels to handle m=4 cases
Details:
- In WOQ, if m = 4, special case kernels are added where
  s4->bf16 conversion happens inside the compute kernel and
  packing is avoided. For all other cases, B matrix is
  dequantized and packed at KC loop level and native bf16
  kernels are re-used at compute level.
- Fixes in bench to avoid accuracy failures when datatype of
  output is bf16.

Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2
2024-09-04 07:36:57 -04:00
Edward Smyth
711dce14d0 Export full set of _blis_impl interfaces
The _blis_impl layer provide a BLAS-like API for use in builds
where BLAS and CBLAS interfaces are not desirable. This patch
generates interfaces in uppercase and with and without trailing
underscores, to match what is generated for the regular BLAS
interface.

AMD-Internal: [CPUPL-5650]
Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2
2024-09-03 04:13:06 -04:00
Mangala V
705755bb5c Revert "Using znver2 flags for building zen/zen2/zen3 kernels on amdzen builds."
This reverts commit 7d379c7879.

Reason for revert: < Perf regression is observed for GEMM(gemm_small_At)
                    as fma uses memory operand >

Change-Id: I0ec3a22acaacfaade860c67858be6a2ba6296bce
2024-09-02 09:07:46 -04:00
Deepak Negi
e429e57b53 Replaced int_32 with dim_t in lpgemm bench
Replaced int32_t with dim_t (int64_t) to avoid overflow.

Change-Id: I4132b72fcbffd9dbd2242b3638922931bcdb1b80
2024-09-02 09:03:02 -04:00
mkadavil
1257eaf72d Disabling smart threading for bandwidth bound input patterns.
For some applications, one of the input dimension is mostly m < MR or
n < NR with the other dimension being small for the most part, with
intermittent large ones. Currently in these cases (m < MR or n < NR),
the number of threads used is reduced (as part of smart threading) if
the other dimension (n or m) is also small. For larger dimensions all
the threads are used.
However its been observed that this reduction of threads hampers the
performance of the larger inputs due to lower operating frequency of
the newly launched threads (apart from the existing ones). Disabling
smart threading for these bandwidth bound input patterns (m < MR or
n < NR) fixes this issue.

AMD Internal: [SWLCSG-2948]

Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf
2024-09-02 02:18:45 +05:30
varshav2
d4e0fa9b4c Revert duplicate check and fix bug in the check for post-ops
- Revert of patch 1110983 - Duplicate check removal and early return for
	s8s8s32/u8s8s32
- Add fix - Added check to see if post-ops is enabled with col-major
  storage and return early in that case.

Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca
2024-08-29 06:52:14 -04:00
Hari Govind S
6dd8f06aff Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set
-  Reverted the change done for tuning ddotv API. When number of threads
   is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads
   are not calculated and as a result number of threads value is -1.
   OpenMP threads are launched with -1 value. This results in crash.
   This bug is fixed by correctly calculating number of threads.

AMD-Internal: [SWLCSG-3028][CPUPL-5689]
Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a
2024-08-29 05:42:03 -04:00
Edward Smyth
1f18eeb267 Determine AMD FP/SIMD execution datapath width
Different Zen processors may have a 512-bit, 256-bit or 128-bit
FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows
a selection of FP512 or FP256 width in BIOS settings. Add cpuid
code to detect the width and store an indication of it in the
global variable bli_fp_datapath. This should be accessed internally
via the function bli_cpuid_query_fp_datapath(). This functionality
is currently only enabled on x86_64 platforms and only currently
reports a value for AMD CPUs.

Also add Zen3 as a fallback path for any unknown AMD processors if
AVX512 is not supported or has been disabled.

AMD-Internal: [CPUPL-4415]
Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a
2024-08-28 12:25:57 -04:00
Deepak Negi
6dcf500703 Element wise operations API for float(f32) input matrix in LPGEMM.
This API supports applying element wise operations (eg: post-ops) on a
float(f32) input matrix to get an output matrix of the same (float(f32)).

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24

AMD Internal: [SWLCSG-2947]

Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24
2024-08-27 03:28:52 -04:00
Chandrashekara K R
2ff0125f11 CMake: Enabled ADDON(aocl_gemm) feature for Windows.
1. Updated datatype from __int64_t to int64_t. Since
   __int64_t was not defined for Windows
2. Updated CMake build system to build lpgemm on windows

Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8
2024-08-23 07:05:28 -04:00
varshav2
e3c434080a Fix duplicate check and early return in s8s8s32/u8s8s32
- removed the duplicate check for col-major inputs in s8s8s32/u8s8s32
  APIs
- Fixed the print in bench_lpgemm

Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c
2024-08-23 02:32:20 +05:30
jagar
8fe1d32316 pkg-config file for blis MT library
Updated Makefile to update st/mt library in blis.pc

Change-Id: Idc61d7652ee6380cf2d73f08caaf9e9216fbb77a
2024-08-22 07:55:55 -04:00
Vignesh Balasubramanian
189a0b7224 Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs
- Bug : For non-zen architectures, {D/C/Z}AXPBY had
  incorrect datatypes passed when querying the computational
  kernel from context. The right datatype is now passed to
  each variant.

- Bug : For ZAXPY, a NULL context was passed to the kernel
  when using the single-threaded path. In case of further
  using the context inside the kernel, this would be an issue.
  We now pass the context instead of a null pointer.

AMD-Internal: [CPUPL-5643]
Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf
2024-08-22 14:12:14 +05:30
Edward Smyth
e7d3d444d8 Disable disabling sba pools
The disable sba pools functionality currently gives incorrect results
at runtime when multiple threads are used. Fixes and improvements are
present in the upstream version of BLIS, so until these are downstreamed
only allow builds where sba pools are enabled.

AMD-Internal: [CPUPL-5512]
Change-Id: I9ccd654477fb714a2fb5f38a138b7e9b5e55e33d
2024-08-21 08:03:43 -04:00
Edward Smyth
3a6d367f9c GTestSuite: Fix TRSM ukr tests in non-zen builds
Add guards around bli_trsm_small kernel tests to only call them
if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing
symbol errors in tests of non-zen builds, e.g. generic or skx.

AMD-Internal: [CPUPL-4500]
Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407
2024-08-21 07:45:06 -04:00