- A matrix by default isn't expected to be packed for a normal row-stored
case. Hence the packing implementation is incomplete.
- But if the user explicitly enables packing, interface wasn't handling
the condition appropriately leading to data overwriting inside the incomplete
pack kernels, thereby leading to accuracy failure.
- As a fix, updated the interface to set the explicit PACK A to UNPACKED and
proceed with GEMM in cases where transpose of A is not necessary.
- Updated the batch gemm input file with additional test cases covering all the
APIs.
Bug Fixes:
- Fixed implementation logic for column major inputs with post-ops to be disabled
in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't
be executed in case of of32/os32 inputs.
- Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading
to accuracy failures.
[AMD-Internal: CPUPL-7283 ]
- Adjust the DCOPY aocl_dynamic threshold on Zen4 for optimal
fast-path selection.
- Extending the Zen5 dynamic-scheduling logic beyond the previous
8-thread limit.
- Update corresponding fast_path_thresh values in the frame such
that it matches the new dynamic logic.
- Added the initial implementation of the dgemm_thread_decision()
function to decide between single-threaded and multi-threaded
execution for DGEMM inputs.
- The function models per-thread tile work, core GFLOP/s, and
thread overhead (T_over=15 µs), and computes a K-threshold
that determines when multi-threading becomes beneficial.
- Returns true for ST and false for MT.
AMD-Internal: [SWLCSG-3418]
- Logic to determine if small code path should be taken or not does not take into account if matrix A is too large.
- Added a condition to use native code path if matrix A is very large.
AMD-Internal: [CPUPL-7201]
- A matrix packing by default in isn't necessary for row-major matrix data. Also, it seems packing of A was
giving regressions and hence wasn't expected to be used.
- However, packA is necessary in column-major cases, where transpose has to be done. This path has been verified.
- Hence, when user sets pack A explicitly, it gets into the incomplete packA function, and overwrites the elements
in the buffer after subsequent iterations, leading to accuracy issues. As a fix to this the patch updates PACK
condition to UNPACKED at the interface while user explicitly sets one, ensuring seamless execution.
[ AMD-Internal : CPUPL - 7193 ]
Details:
- Fixed loading of matadd and matmul pointers in GEMV
lt16 kernel for AVX2 M=1 case.
- Hard-set row-stride of B to 1(inside GEMV), when it has
already been reordered.
AMD-Internal:CPUPL-7197, CPUPL-7221
Co-authored-by:Balasubramanian, Vignesh <Vignesh.Balasubramanian@amd.com>
Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers.
The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly.
Changes:
- Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps()
- Eliminate temp register reuse to fix instruction scheduling conflicts
AMD-Internal: [CPUPL-6445]
- Added WORKING_DIRECTORY to try_run() calls to ensure execution occurs in ${BLIS_PATH}/lib.
- Prevents Windows error 0xc0000135 caused by missing DLLs during runtime of get_version.cpp.
- Ensures compatibility with both static and shared BLIS builds by aligning runtime context
with expected DLL locations.
AMD-Internal: [CPUPL-7187]
- Currently TRSM reference kernels are derived from GEMM blocksizes and GEMM_UKR.
- This does not allow the flexibility to use different GEMM_UKR for GEMM and TRSM if optimized TRSM_UKR are not available.
- Made changes so that ref TRSM kernels are derived from TRSM blocksizes.
- Changed ZEN4 and ZEN5 cntx to use AVX2 kernels for CTRSM.
AMD-Internal: [SWLCSG-3702]
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
* Optimized avx512 ZGEMM kernel and edge-case handling
Edge kernel implementation:
- Refactored all of the zgemm kernels to process micro-tiles efficiently
- Specialized sub-kernels are added to handle leftover m dimention:12MASK,
8, 8MASK, 8, 4, 4MASK, 2.
- 12MASK edge kernel handles 11, 10, 9 m_left using 2 full zmm
load/store and 1 masked load/store.
- Similarly 8MASK handles 7, 6, 5 m_left using 1 full zmm load/store and
1 masked load/store.
- 4MASK handles 3, 1 m_left using 1 masked load/store.
- ZGEMM kernel now internally decomposes the m dimension into the following.
The main kernel is 12x4, which is having following edge kernels to
handle left-over m dimension:
edge kernels:
12MASKx4 (handles 11x4, 10x4, 9x4)
8x4 (handles 8x4)
8MASKx4 (handles 7x4, 6x4, 5x4)
4x4 (handles 4x4)
4MASKx4 (handles 3x4, 1x4)
2x4 (handles 2x4)
- similarly it decomposes for (12x3, 12x2 and 12x1) n_left kernels under
which the following edge kernels 12MASKxN_LEFT(3, 2, 1), 8XN_LEFT(3, 2, 1),
8MASKxN_LEFT(3, 2, 1), 4xN_LEFT(3, 2, 1), 4MASKxN_LEFT(3, 2, 1),
2xN_LEFT(3, 2, 1) handles leftover m dimension.
Threshold tuning:
- Enforced odd m dimension to avx512 kernels in tiny path, as avx2
kernels invokes gemv calls for m_left=1(odd m dimension of matrix)
The gemv function call adds overhead for very small sizes and results
in suboptimal performance.
- condition check "m%2 == 0" is added along with threshold checks to
force input with odd m dimension to use avx512 zgemm kernel.
- Threshold change to route all of the inputs to tiny path. Eliminating
dependency of avx2 zgemm_small path if A, B matrix storage is 'N'(not transpose) or
'T'(transpose).
- However tiny re-uses zgemm sup kernels which do not support
conjugate transpose storage of matrices. For such storage of
A, B matrix we still rely on avx2 zgemm_small kernel.
gtest changes:
- Removed zgemm edge kernel function(8x4, 4x4, 2x4 and fx4) and their
respective testing instaces from gtest.
AMD-Internal: [CPUPL-7203]
---------
Co-authored-by: harsdave <harsdave@amd.com>
New tests:
- Add IIT_ERS tests for TRMM, SYMM, SYRK, SYR2K, HEMM, HERK, HER2K
Corrections and improvements:
- GEMM: Use local definitions of input size, trans, etc arguments
to allow finer control of choices, especially for testing invalid
leading dimensions.
- GEMM, GEMMT, GEMM_COMPUTE, TRMM, TRSM: In alpha=beta=zero
test, initialize C to extreme value to test that C is set rather than scaled
- GEMM: Use correct M x N dimensions for C in calls to computediff
- GEMM: Declare info variable in disabled tests in gemm_IIT_ERS.cpp
AMD-Internal: [CPUPL-6725]
- GEMV transpose kernels lack ability to compute directly on non-unit stride inputs.
- This limitation is stopping libflame to use blis kernel directly instead of going through framework.
- Added ability to handle non-unit incx in the kernel by packing x into a temporary buffer.
AMD-Internal: [CPUPL-6903]
New tests:
- Add IIT_ERS tests for HEMV, HER, HER2, SYMV, SYR, SYR2
Corrections and improvements:
- GEMV: Correct matrix sizes for transpose cases when checking
input matrix A has not been modified.
- GEMV: Initialize y to extreme value for alpha=beta=zero case
to check y is set rather than scaled.
AMD-Internal: [CPUPL-6725]
- Status of security flags are already printed in the output from configure and CMake.
- Prints in make command are redundant.
- Removed prints from make to keep the build log clean.
AMD-Internal: [CPUPL-7222]
More improvements to DTL coverage and coding:
- Expand logging and tracing coverage to IxAMIN and GEMM_BATCH APIs
- Expand logging and performance states to GEMM3M APIs
- Expand logging coverage to matrix copy, transpose and add APIs
- Misc tidying of code
AMD-Internal: [CPUPL-7010]
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.
AMD-Internal: [CPUPL-6579]
Following Flags have been added.
1. D_FORTIFY_SOURCE=2
What it does
• At compile time the header files replace certain libc calls (strcpy, sprintf, …) with inline wrappers that perform a compile-time length check whenever the size of the destination buffer is known.
• At run time an extra check is executed only if the compiler could not prove the copy is safe.
Cost
• Only functions that call those specific libc routines pay anything.
2. fstack-protector-strong
What it does
• Functions that contain local arrays, address‐taken locals, or alloca get a canary word inserted into the stack frame.
• The function prologue writes the canary; the epilogue verifies it before the ret.
Cost
• 8 bytes of additional stack per protected function frame.
• Two or three extra instructions per entry/exit.
4. Wl,-z,relro
What it does
• Marks the relocation tables read-only after relocation is finished.
• No effect once the library is fully loaded.
Cost
• None at run time.
5. Wl,-z,now
What it does
• Forces the dynamic loader to resolve all external symbols in the library up-front instead of lazily on first call.
Cost
• Startup: one extra relocation pass.
• Steady-state execution: zero or slightly faster, because PLT stubs are bypassed.
Usage:
cmake -DENABLE_SECURITY_FLAGS=off
cmake -DENABLE_SECURITY_FLAGS=on
configure --enable-security-flags
configure --disable-security-flags
AMD-Internal: [CPUPL-6886]
- For Conjugate inputs, ZTRSM small code path is less accurate than native codepath.
- Redirected the conjugate inputs to native code path on ZEN4 if TRSM preinversion is disabled.
- Tuned AOCL_DYNAMIC to handle the new inputs redirected to ZTRSM native.
Previously, the ZGEMM implementation used `zscalv` for cases
where the M dimension of matrix A is not in multiple of 24,
resulting in a ~40% performance drop.
This commit introduces a specialized edge cases in pack kernel
to optimize performance for these cases.
The new packing support significantly improves the performance.
- Removed reliance on `zscalv` for edge cases, addressing the
performance bottleneck.
AMD-Internal: [CPUPL-6677]
Co-authored-by: harsh dave <harsdave@amd.com>
Introduced support for GEMV operations with group-level symmetric quantization for the S8S8S32032 API.
Framework Changes:
- Added macro definitions and function prototypes for GEMV with symmetric quantization in lpgemm_5loop_interface_apis.h and lpgemm_kernels.h.
- LPGEMV_M_EQ1_KERN2 for the lpgemv_m_one_s8s8s32os32_sym_quant kernel, and
- LPGEMV_N_EQ1_KERN2 for the lpgemv_n_one_s8s8s32os32_sym_quant kernel.
- Implemented the main GEMV framework for symmetric quantization in lpgemm_s8s8s32_sym_quant.c.
Kernel Changes:
- lpgemv_m_one_s8s8s32os32_sym_quant for handling the case where M = 1 and implemented in lpgemv_m_kernel_s8_grp_amd512vnni.c.
- lpgemv_n_one_s8s8s32os32_sym_quant for handling the case where N = 1 and implemented in lpgemv_n_kernel_s8_grp_amd512vnni.c.
- Updated the buffer reordering logic for group quantization for N=1 cases in aocl_gemm_s8s8s32os32_utils.c.
Notes
- Ensure that group_size is a factor of both K (and KC when K > KC).
- The B matrix must be provided in reordered format (mtag_b == REORDERED).
AMD-Internal: [SWLCSG-3604]
- Replaced separate real and imaginary accumulators (real_acc, imag_acc) with a column-wise accumulator array (row_acc[2]), making accumulation and updates to the target Y vector more direct, concise, and unified.
- Leveraged AVX-512 fused multiply-add/subtract operations (_mm512_fmaddsub_pd, _mm512_fmsubadd_pd) and efficient permutations (_mm512_permute_pd) to enable accurate and efficient computation of real and imaginary components in a single instruction, while reducing code complexity for both code paths.
- Removed redundant instructions (such as unnecessary permutations and zero-register operations) and simplified the control flow.
AMD-Internal: [CPUPL-7015]
* Bugfix: Tuned zgemm threshold for zen4
Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.
This tuning forces skinny matrices to take SUP path to ensure better
performance.
* Bugfix: Tuned zgemm threshold for zen4 and zen5
Threshold tuning that determines whether SUP or native path should
be used for given input matrix size.
This tuning forces skinny matrices to take SUP path to ensure better
performance.
---------
Co-authored-by: harsdave <harsdave@amd.com>
More improvements to DTL coverage and coding:
- Removed some DTL overheads from performance stats timing for all APIs
where it is currently implemented (i.e. gemm, gemmt, trsm, nrm2)
- Expand logging coverage to gemm pack and compute APIs, including
performance stats for gemm_compute
- Expand logging coverage to rot, rotg, rotm and rotmg APIs
- Tidied order of function prototypes in aocl_dtl/aocldtl_blis.h
AMD-Internal: [CPUPL-7010]
Commit eaa76dfe28 added LAPACK 3.12 GEMMTR
interfaces as aliases to existing BLIS GEMMT. Here we add full set of
Fortran upper case and no underscore API aliases and _blis_impl variants.
AMD-Internal: [CPUPL-6581]
More improvements to DTL coverage and coding:
- Expand logging coverage to banded matrix APIs in frame/compat/f2c
- Expand logging coverage to packed matrix APIs in frame/compat/f2c
- Commit b8aa5c2894 was wrong to
remove calls to AOCL_DTL_INITIALIZE for APIs where bli_init_auto()
is not called. AOCL_DTL_INITIALIZE is essential when logging is
enabled but tracing is not, otherwise the ICV gbIsLoggingEnabled
will not be initialized based on logging status and remain as
the default FALSE value.
AMD-Internal: [CPUPL-7010]
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.
AMD-Internal: [SWLCSG-3408]
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
amdzen and x86_64 configuration family: On Intel processors supporting
AVX-512, the zen4 sub-configuration was dispatched by default, as even
though it is not optimized specifically for Intel processors, it includes
a range of additional optimizations than are present in the the older skx
sub-configuration. However, the zen4 data path is 256 bit, thus this
sub-configuration uses a mixture of AVX2 and AVX-512 kernels. Now that
zen5 sub-configuration is available, with more extensive use of AVX-512
kernels, switch to use this by default on relevant Intel processors.
intel64 configuration family: On AMD processors supporting AVX-512 or
AVX2, the generic sub-configuation was dispatched by default. Change to
dispatch skx or haswell sub-configuation, based on the available ISA
support.
AMD-Internal: [CPUPL-6743]
Static analysis issues in ZTRSM (triangular solve with matrix) kernels for Zen5 architecture by initializing variables to prevent potential use of uninitialized values.
Initialize loop variables i, j, and k_iter to 0 to prevent potential uninitialized access
Initialize mask variables and remainder variables to 0 across multiple kernel functions
We currently use clang compiler on Windows, so no problem with current
builds. However, if we wanted to use Microsoft compiler, add different
definition of BLIS_THREAD_LOCAL as __declspec(thread)
AMD-Internal: [CPUPL-6958]
More improvements to DTL coverage and coding:
- Tidy functions and prototypes in aocl_dtl/aocldtl_blis.{c,h} into
alphabetical groups within different BLAS categories.
- Expand tracing coverage to APIs in frame/compat/f2c
- Remove calls to AOCL_DTL_INITIALIZE (added in
c56dcb6ffb) as DTL_Trace calls
bli_init_auto which will call AOCL_DTL_INITIALIZE
AMD-Internal: [CPUPL-7010]
- Removed duplicate calls to BATCH_GEMM_CHECK().
- Refactored freeing of post-op pointer in bench code and verified the
functionality.
- Modified indexing of the array to take the correct values.
- Updated the thresholds to enter the AVX512 SUP codepath in
ZGEMM(on ZEN5). This caters to inputs that scale well with
multithreaded-execution(in the SUP path).
- Also updated the thresholds to decide ideal threads, based on
'm', 'n' and 'k' values. The thread-setting logic involves
determining the number of tiles for computation, and using them
to further tune for the optimal number of threads.
- This logic builds over the assumption that the current thread
factorization logic is optimal. Thus, an additional data analysis
was performed(on the existing ZEN4 and the new ZEN5 thresholds),
to also cover the corner cases, where this assumption doesn't hold
true.
- As part of the future work, we could reimplement the thread
factorization for GEMM, which would additionally require a new
set of threshold tuning for every datatype.
AMD-Internal: [CPUPL-7028]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation
removing all multiplies by 1.0 during GFLOP computation
AMD-Internal: CPUPL-7016
---------
Co-authored-by: Rayan <rohrayan@amd.com>
Code cleanup: Removing some redundant if-else code in the CGEMM
aocl-dynamic logic. This should ensure that multiple branching is
avoided, while preserving existing heuristics.
AMD Internal: [CPUPL - 6579]
Co-authored-by: Rayan <rohrayan@amd.com>
Introduced BLIS_ATTRIB_ALIGN to standardize 64-byte alignment across platforms.
On Windows, alignment is enabled only for Clang and disabled for other compilers.
Replaced direct usage of __attribute__((aligned(64))) in rntm_s with the macro.
Instead of editing a header file, add options to build systems to allow
DTL tracing and/or logging output to be generated. For most users
logging is recommended, producing a line of output per application
thread of every BLAS call made. Tracing provides more detailed info
of internal BLIS calls, and is aimed more at expert users and BLIS
developers. Different tracing levels from 1 to 10 provide control of
the granularity of information produced. The default level is 5. Note
that tracing, especially at higher tracing levels, will impose a
significant runtime cost overhead.
Example usage:
Using configure:
./configure ... --enable-aocl-dtl=log amdzen
./configure ... --enable-aocl-dtl=trace --aocl-dtl-trace-level=6 amdzen
./configure ... --enable-aocl-dtl=all amdzen
Using CMake:
cmake ... -DENABLE_AOCL_DTL=LOG
cmake ... -DENABLE_AOCL_DTL=TRACE -DAOCL_DTL_TRACE_LEVEL=6
cmake ... -DENABLE_AOCL_DTL=ALL
Also, modify function AOCL_get_requested_threads_count to correct
reported thread count in cases where internal value is recorded as -1
AMD-Internal: [CPUPL-7010]
Description:
1. Crated f32 intrinsic kernels without post-ops support f32 gemm
without post-ops optimally.
2. Initiated the no post-ops kernels from main kernel when post-ops
hander has no post-ops to do.
3. The kernels are redundant but added to get the best perf
for pure GEMM call.
AMD-Internal : SWLCSG-3692
The environment variable AOCL_VERBOSE was inconsistent in its
behaviour, sometimes producing a single line of output per file from
multiple BLAS calls, when it should be all or nothing. Note that:
- AOCL_VERBOSE is only active when DTL logging has been enabled at
compile time. Otherwise, this environment variable is not read.
- When logging is enable at compile time, logging output is produced
by default. Thus AOCL_VERBOSE is more of use to turn output off,
rather than on.
- For production runs without logging, it is recommended to recompile
with DTL disabled, as this minimizes overheads within the BLIS code.
- AOCL_VERBOSE should be set to 0 or 1, and not values such as FALSE
or TRUE.
Changes to improve consistency when AOCL_VERBOSE is set:
- Change DTL variables from Bool (unsigned char) datatype to bool, as
used elsewhere in BLIS.
- Ensure bli_init_auto() is called before AOCL_DTL_TRACE_ENTRY() and
AOCL_DTL_LOG_*_INPUTS(), as bli_init_auto calls AOCL_DTL_INITIALIZE()
- In APIs which avoid calling bli_init_auto(), add explicit calls to
AOCL_DTL_INITIALIZE(). Also, make a proper comment about not calling
bli_init_auto(), rather than just commenting out call, which looks like
dead code.
Other DTL logging control changes:
- Make gbIsLoggingEnabled ICV thread local as this can be updated by
calls to AOCL_DTL_Enable_Logs and AOCL_DTL_Disable_Logs APIs
- After recent changes to hide some internal BLIS definitions behind
ifdef BLIS_IS_BUILDING_LIBRARY guard, change BLIS_THREAD_LOCAL
definition to be exported again.
Logging output changes:
- Standardize printing of datatype to be lower case.
- Don't force printing of GEMM transa and transb to upper case, instead
print in the case provided by the application code.
- Add logging output to all variants (in terms of AMD/non-AMD optimized
and datatype) of SWAP and SCAL.
AMD-Internal: [CPUPL-7010]
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
section responsible for handling non-unit stride y has been removed from the
frame.
- The kernel code is extended with if condition to handle both unit and non-unit
stride y.
AMD-Internal: [CPUPL-6869]
- As part of rerouting to AVX2 code-paths on ZEN4/ZEN5(or similar)
architectures, the code-base established a contingency when
deploying fat binary on ZEN/ZEN2/ZEN3 systems. Due to this,
it was required that we always set AOCL_ENABLE_INSTRUCTIONS to
'ZEN3'(or similar values) to make sure we don't run AVX512
code on such architectures. This issue existed on FP32 and BF16
APIs.
- Added checks to detect the AVX512-ISA support to enable rerouting
based on AOCL_ENABLE_INSTRUCTIONS. This removes the incorrect
constraint that was put forth.
AMD-Internal: [CPUPL-7020]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
- For int8/uint8 reorder function, the k dimension is made multiple of 4 to
meet the alignment requirements.
- Modified the logic to update the k_updated to use multiples of 4.
[AMD - Internal : SWLCSG - 3686 ]
The blis.h header file includes a lot of BLIS internal definitions. Some of these caused problems
when using a BLIS library compiled with clang on Windows from an applications compiled with
the Intel icc and icx compilers. Workaround is to use "#ifdef BLIS_IS_BUILDING_LIBRARY" to
guard these definitions from being exposed to applications including blis.h. (The BLIS configure
and cmake builds systems automatically define BLIS_IS_BUILDING_LIBRARY only for compiling
the BLIS library.)
This patch implements the minimum changes to resolve the issue. Longer term, similar changes
may need to be added around all BLIS internal definitions in blis.h.
AMD-Internal: [CPUPL-6953]
- Modified the correct variables to be passed for the batch_gemm_thread_decorator() for
u8s8s32os32 API.
- Removed commented lines in f32 GEMV_M kernels.
- Modified some instructions in F32 GEMV M and N Kernels to re-use the existing macros.
- Re-aligned the BIAS macro in the macro definition file.
[ AMD - Internal : CPUPL - 7013 ]