Details:
- Corrected the usage of vpdpbusd instruction in
GEMV implementation for INT8 APIs.
- Modified bench to fill matrices with values
ranging between -5 and +5 whenever the datatype is
a signed integer.
Change-Id: I457462b888b667d8a34c53de762e9b4aee784ecc
Reordering B matrix of datatype int4 is done as per the pack schema
requirements of u8s8s32 kernel. However for fringe cases, the
matrix pointer increments need to be halved to account for the half
byte size of int4 elements.
AMD-Internal: [SWLCSG-2390]
Change-Id: I22a04c4c8133db6ae6ca0a4d3e86c11aba1e2cdb
* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
Use of AOCL_ENABLE_INSTRUCTIONS in dgemm tiny code path is
unnecessary and incorrectly caused AVX512 code to be run
on zen4 and later processors when AOCL_ENABLE_INSTRUCTIONS=avx2
or equivalent options was selected.
Replace with code to select kernel in a similar way to other
dgemm code paths and other APIs. Note that at present AVX2 code
is used the smallest matrix sizes on all zen platforms.
AMD-Internal: [CPUPL-5078]
Change-Id: Ie6b4895461cbbb915d2b48b92fc063f5cd6adb85
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
Support for reordering B matrix of datatype int4 as per the pack schema
requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion
implemented via leveraging the vpmultishiftqb instruction. The reordered
B matrix will then be used in the u8s8s32o<s32|s8> api.
AMD-Internal: [SWLCSG-2390]
Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba
Details:
- For a variable x, Using address of x in an instruction throws
exception if the difference between &x and access position is
larger than 2 GiB. To solve this issue all variables are stored
within the JIT code section and are accessed using relative addressing.
- Fixed a bug in B matrix pack function for s8s8s32os32 API.
- Fixed a bug in JIT code to apply bias on col-major matrices.
AMD-Internal: [SWLCSG-2820]
Change-Id: I82f117a0422c794cb9b1a4d65a89d60de4adfd96
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
- Updated the fused kernels (DOTXF and AXPYF) to properly handle cases
when b_n > fuse_factor.
- The fused kernels are expected to invoke respective Level-1 kernels
iteratively when b_n > fuse_factor.
AMD-Internal: [CPUPL-5246]
Change-Id: Ie7a0f4e61ede088663e3491269b3f1398d028095
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C
FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0
Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C
- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite
AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
- Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs.
- Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it.
AMD-Internal: [CPUPL-4877]
Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba
- When n=1, reorder of B matrix is avoided to efficiently
process data. A dot-product based kernel is implemented to
perform gemv when n=1.
AMD-Internal: [SWLCSG-2354]
Change-Id: If5f74651ab11232d0b87d34bd05f65aacaea94f1
Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests
for different ISAs (i.e. BLIS sub-configurations):
- Add missing SSE and AVX options. These will all chose the
generic option in amdzen builds.
- For unsupported ISAs (e.g. AVX512 on Milan), select the
hardware's default sub-configuration instead of trying
to step down through alternative choices.
- For invalid options, or options not implemented in the BLIS
build (e.g. skx in amdzen build), select the hardware's
default sub-configuration instead of aborting.
Currently BLIS_ARCH_TYPE behaviour is not affected by these
changes.
AMD-Internal: [CPUPL-5078]
Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f
- Introduced new 8x24 row preferred kernel for zen5.
- Kernel supports row/col/gen
storage schemes.
- Prefetch of current panel of A and C
are enabled.
- Prefetch of next panel of B is enabled.
- Kernel supports negative offsets for A and B
matrices.
- Cache block tuning is done for zen5 core.
AMD-Internal: [CPUPL-5262]
Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f
- Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for
n_elem <= 4000000
AMD-Internal: [CPUPL-4408]
Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc
Enabled command line option to link libiomp5.so or libomp.so or libgomp.so libraries using cmake.
Eg:- -DOpenMP_libomp_LIBRARY=<path to openmp library including library name>.
If we not set above variable, by default openmp library will be libomp.so for clang and libgomp.so for gcc compiler.
Change-Id: I5bffa10ff8351f5d10f0d543cbdf55aa16c84c90
- Fixed bug in ddotxf generic tests where the parameters lda_inc and
inca were being read incorrectly.
- Fixed bug in dotxf test wherein the y vector was being generated with
length m instead of b.
- Corrected function signatures to use type gtint_t instead of gint_t.
- Updated the tests to use conjugate values of type char and convert to
conj_t type only while invoking BLIS tests for both DOTXF and AXPYF.
AMD-Internal: [CPUPL-5117]
Change-Id: I0ef7af429057583a1cbf34827802e72401181caf
Improve consistency in test names across different APIs:
- Improve consistency of TEST_P part of test names.
- Rename *_evt_testing.cpp and nrm2_extreme.cpp files to
*_evt.cpp to match other APIs.
- Standardize naming of IIT_ERS files.
Also:
- Restore trsv IIT_ERS file which was misnamed in commit
a2beef3255
- Tidy ukr gemm tests to be more consistent with each other
and move threshold setting to individual TEST_P functions
to allow different adjustments to be made.
- Similarly make trsm tests more consistent.
- Tidy naming of is_memory_test variable.
AMD-Internal: [CPUPL-4500]
Change-Id: I0af1fc9973b02187b19a7c2488eed1b829cfdc2f
Description:
- _mm512_storeu_epi8 and _mm512_storeu_epi16
intrensic instructions are not available in gcc-10
- Replaced above intrensics _mm512_storeu_si512
Change-Id: I2878780b7acd040ccf45e571d486ff8c2388088c
-As it stands the bf16bf16f32ob16 API expects bias array to be of type
float. However actual use case requires the usage of bias array of bf16
type. The bf16 micro-kernels are updated to work with bf16 bias array by
upscaling it to float type and then using it in the post-ops workflow.
-Corrected register usage in bf16 JIT generator for bf16bf16f32ob16 API
when k > KC.
AMD-Internal: [SWLCSG-2604]
Change-Id: I404e566ff59d1f3730b569eb8bef865cb7a3b4a1
Description:
--Added support for tranB in u8s8s32o<s32|s8> and
s8s8s32o<s32|s8> API's
--Updated the bench_lpgemm by adding options to
support transpose of B matrix
--Updated data_gen_script.py in lpgemm bench
according to latest input format.
AMD-Internal: [SWLCSG-2582]
Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048
BLIS includes the BLAS and CBLAS interfaces for zdscal
but not the BLIS typed interface bli_zdscalv. Thus, when
TEST_INTERFACE=BLIS_TYPED is defined, disable tests
for zdscal.
AMD-Internal: [CPUPL-4671]
Change-Id: I397454c83e272f9e775e37e00533002576041a93
- Updated the parameter check for leading dimensions in the
functions handling transpose case of matrix A.
- Updated the logic to perform ?IMATCOPY operation. The new
logic uses an auxiliary buffer to copy and scale in place,
if and when needed. This is done in order to avoid overwriting
any subsequent reads that might follow(specifically in case
of having different leading dimensions for reading and writing).
- Updated xerbla_() to throw memory allocation failure based on
INFO parameter being -10. This value is specific to its use-case
in ?IMATCOPY, where it is set to -10.
- Updated the Extreme Value Tests(EVT) logger for ?IMATCOPY
for uniformity.
- Cleaned up the files to follow coding conventions.
AMD-Internal: [CPUPL-4862][SWLCSG-2706]
Change-Id: I34dfa2bcb66b821315e11f7ab2139c41a79ef780
- Correct value of alpha in ger ERS test.
- rename ERS_IIT.cpp files to match naming convention
used for other APIs.
- Change all cases of gint_t to gtint_t except for
dotxf, which is fixed in another commit.
- Add TEST_UPPERCASE_ARGS to imatcopy and omatcopy{2}
headers.
- Corrected typo.
AMD-Internal: [CPUPL-4500]
Change-Id: I8844bb8c5941785e64daa9df5569092c19f91838
-Adding multiplier for complex APIs.
-Updating for trmv and trsv to reflect multiplication with alpha.
AMD-Internal: [CPUPL-4500]
Change-Id: I17361da5afa5d1e219b4c8a14542e2b216a7ea58
Improve consistency in test names across different APIs.
In this commit, standardize leading dimensions (lda, ldb,
ldc) in test names. Also some misc tidying changes.
AMD-Internal: [CPUPL-4500]
Change-Id: Icbc82d0b9a3420ddfdb4f418396f9e56ab1765ab
Correction to commit 8657e661fc
to allocate matrix or vector correctly when special read-only
case occurs.
Also define a set_matrix generator for symmetric matrices
to only set upper or lower triangle to the supplied value,
while setting the unused elements to a large value to help
catch incorrect access to those elements.
AMD-Internal: [CPUPL-4548]
Change-Id: I22b3a20e2ce8be70eb27179247cd47fdb2d87b9d
Export more symbols for BLIS kernels so that AOCL libFLAME
optimizations can call them directly.
AMD-Internal: [CPUPL-5044]
Change-Id: I45392b8a2a14ac2816141521b90b7ddb1216c733
1. Enabled AVX512 path for
- Upper variant
- Different storage schemes for upper and lower variant
2. Modified mask value to handle all fringe cases correctly
AMD_Internal: [CPUPL-5091]
Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.
AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
- Enabled DGEMMT SUP upper kernels in AVX512 code path.
- Enabled use of optimized kernels for all the storages
supported by optimized kernels.
AMD-Internal: [CPUPL-4881]
Change-Id: Id4486610dacaabc405fbc35b2588607c6508705e
- Implemented optimized lpgemv for both m == 1 and n == 1 cases.
- Fixed few bugs in LPGEMV for bf16 and f32 datatypes.
- Fixed few bugs in JIT-based implementation of LPGEMM for BF16
datatype.
AMD-Internal: [SWLCSG-2354]
Change-Id: I245fd97c8f160b148656f782d241f86097a0cf38
Improve consistency in test names across different APIs.
Various changes in this patch:
- Explicitly cast char variables to std::string when
adding to test name. Adding the char directly was
causing errors in name generation.
- Use template version of print function in zdscalv
and remove print function zdscalvGenericTestPrint.
- Remove unused print function ztrsvPrint.
- Eliminate some differences in gemm ukr print
functions.
- Remove extraneous API name labels in ukr axpyf and
setv.
- Make ukr/trsm/test_trsm_ukr.h more consistent with
other files.
AMD-Internal: [CPUPL-4500]
Change-Id: Ib8092de216712586fe4ec0ae91698d0c1aaffd54
SWISH post-op computes swish(x) = x / (1 + exp(-1 * alpha * x)).
SiLU = SWISH with alpha = 1. Adding the support for swish in JIT
based BF16 kernels.
AMD-Internal: [SWLCSG-2387]
Change-Id: I9eea0c801f5f067a5cfbd2941bc991708b86e45e
Improve consistency in test names across different APIs.
In this commit, standardize storage, side, uplo, trans
diag and conj in test names.
AMD-Internal: [CPUPL-4500]
Change-Id: Ifcdb6e9f684b134841d86087218d7aefd9cabe63
- Reduced number of jump operations in AVX512
assembly kernel for SCOPYV, DCOPYV and ZCOPYV.
- Fixed memory test failure for bli_zcopyv_zen_int_avx512
kernel.
- Replaced existing AVX2 COPYV intrinsic kernels in
bli_cntx_init_zen5.c with AVX512 assembly kernels.
Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159
Some BLAS routines do not require matrices or vectors to be
initialized in certain use cases. For example, in GEMM when
beta=zero, C is set rather than updated, thus input values of
C should not be used. In these cases set the inital values of
such matrices or vectors to an extreme value, to help detect
if these are incorrectly being read.
The extreme value can be NaN or Inf. The default is Inf,
change it by running
cmake ... -DEXT_VALUE=NaN
AMD-Internal: [CPUPL-4548]
Change-Id: I4a665363779d2496b8247f6357e970b7f23cd1eb
- Utilized the memory testing feature in GTestsuite
to update the testing interfaces for micro-kernel
testing of SCOPY, DCOPY and ZCOPY APIs.
Change-Id: I3d6905f33b000b8d5e60727aa896bd869f4f441f
- Existing vectorizes code was disabled because
of the failures observed in matlab tests.
- The issue is caused by underflow during division when diagonal
elements of A matrix are very small.
- When diagonal is very small (4E-324 in case of matlab), sqauring the
diagonal during divison causes the square to be rounded off to zero.
- Fix is to normalise (ar) and (ai) by dividing (ar) and (ai) by
max(ar, ai), this will make either (ar) or (ai) 1, and hence
reduce the likelihood of underflow.
AMD-Internal: [CPUPL-5052]
Change-Id: Iff7893fdcb92907a12e6af8e102a92637a13ce4f
- Added accuracy and memory tests for AVX2 and AVX512 ?SETV kernels,
AVX512 ZAXPYV kernel and AVX512 ZAXPYF kernels, with fuse-factors
2, 4 and 8.
- Cleanup of the code-section that declares and defines the reference
compute for AXPYF operation. Corrected the type mismatch with the
arguments that reference AXPYV would expect(this is used to decompose
AXPYF as part of reference). Ensured usage of GTestSuite's internal
alias for integer types.
- Updated the API level testsuite and testing interface for AXPYF,
based on the cleaup done to the reference code.
AMD-Internal: [CPUPL-4974]
Change-Id: I71de6c09d3877cd3dd1eaa20ab4f90e7c33eb1e1
Test programs for key APIs like GEMM take a long time to run,
and even to generate the list of test cases. Break into
separate test programs for different data types to enable
these to run in parallel (at gtest level). In this patch
we break up GEMM, TRSM, GEMV and TRSV.
AMD-Internal: [CPUPL-4500]
Change-Id: I21363b050d30e0402d5a1e8cbeaed2ebcc87aaeb
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
- Updated DOTV Gtestsuite interface to invoke C/ZDOTC when conjx='c'
and testing interface is either BLAS or CBLAS.
- Added ukr tests for bli_zdotv_zen4_asm_avx512( ... ) and
bli_zdotv_zen_int_avx512( ... ) kernels.
AMD-Internal: [CPUPL-5011]
Change-Id: I32fb69027a35d9ea92f997a095d412c8242a4b68
* Functional tests are covered for saxpyv and zaxpyv.
* As part of functional large size of m, stride greater than m, scalar
combinations(including special cases), Zero increment tests are
added for saxpyv and zaxpyv.
Signed-off-by: eseswari <sangadala.eswari@amd.com>
AMD-Internal: CPUPL-4413
Change-Id: I61473357680cb0f394e6e653796ec31110895fa4
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8