-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.
AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
Details:
- LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout
memory when A matrix needs to be packed. This multi-threaded lock
overhead becomes prominent when m/n dimensions are relatively small,
even when k is large. In order to address this, bli_pba_acquire_m
is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE,
the memory is allocated using aligned malloc instead of checking
out from memory pool. Experiments have shown malloc costs to be
far lower than memory pool guarded by locks, especially for higher
thread count.
- Deleted few unnecessary instructions from packing kernels.
- Replaced bench_input.txt with lesser number of inputs.
AMD-Internal: [CPUPL-4329]
Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774
Some AVX512 intrinsics(eg: _mm_loadu_epi8) were introduced in later
versions of gcc (11+) in addition to already existing masked intrinsic
(eg: _mm_mask_loadu_epi8). In order to support compilation using gcc
10.2, either the masked intrinsic or other gcc 10.2 compatible intrinsic
needs to be used (eg: _mm_loadu_si128) in LPGEMM <u|s>8s8os32 kernels.
AMD-Internal: [SWLCSG-2542]
Change-Id: I6cfedfdcb28711b19df63d162ab267f5eea8d2ef
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
1. Prefetch only MR rows or rows required for fringe cases
2. Specify prefetching offset - the least column address supported
by masked functions
3. Removed unnecessary prefetches in fringe case for mx4 kernels
Updated gtestuite for sgemm calls
AMD_Internal: [CPUPL-4221]
Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c
- This commit uses avx2 and avx512 masked load instructions
for handling edge case where vector size is not exact multiple
of avx2/avx512 vector register size.
- Thanks to Shubham, Sharma <shubham.sharma3@amd.com> for
avx512 ddotv kernel changes
Change-Id: I998651eeb1083caf3308f1b45bd7d55b7974bcb4
Segfault was reported through nightly jenkins job.
Issue was observed when running in MT mode.
Issue was due to extra broadcast being used.
Extra broadcast would access out of bound memory on input buffer
Cleaned up cobbler list by removing unused registers.
AMD_Internal: [CPUPL-4180]
Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b
* commit '5013a6cb':
More edits and fixes to docs/FAQ.md.
Fixed newly broken link to CREDITS in FAQ.md.
More minor fixes to FAQ.md and Sandboxes.md.
Updates to FAQ.md, Sandboxes.md, and README.md.
Safelist 'master', 'dev', 'amd' branches.
Re-enable and fix fb93d24.
Reverted fb93d24.
Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
Removed last vestige of #define BLIS_NUM_ARCHS.
Added new packm var3 to 'gemmlike'.
Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
Fix more copy-paste errors in the haswell gemmsup code.
Do a fast test on OSX. [ci skip]
Fix AArch64 tests and consolidate some other tests.
Use C++ cross-compiler for ARM tests.
Attempt to fix cxx-test for OOT builds.
Updated travis-ci.org link in README.md to .com.
Disabled (at least temporarily) commit 8e0c425.
Define BLIS_OS_NONE when using --disable-system.
Updated stale calls to malloc_intl() in gemmlike.
Blacklist clang10/gcc9 and older for 'armsve'.
Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
Moved lang defs from _macro_def.h to _lang_defs.h.
Minor tweaks to gemmlike sandbox.
Added local _check() code to gemmlike sandbox.
README.md citation updates (e.g. BLIS7 bibtex).
Tweaks to gemmlike to facilitate 3rd party mods.
Whitespace tweaks.
Add row- and column-strides for A/B in obj_ukr_fn_t.
Clean up some warnings that show up on clang/OSX.
Remove schema field on obj_t (redundant) and add new API functions.
Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
Disabled sanity check in bli_pool_finalize().
Implement proposed new function pointer fields for obj_t.
AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format
- New edge kernels that uses masked load-store
instructions for handling corner cases.
- Mask load-store instruction macros are added.
vmaskmovps, VMASKMOVPS for masked load-store.
- It improves performance by reducing branching overhead
and by being more cache friendly.
- Mask load-store is added only for row storage format
AMD-Internal: [CPUPL-4041]
Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
- Following optimizations are included for dgemm 6x8 native kernel.
1) Reorganized the C update and store to reduce register dependencies.
2) moved the C prefetch to part-way through the kernel for efficiently
prefetching C matrix at appropriate distance.
3) Offsetting A matrix, so that kernel can use a smaller instruction
encoding saving, saving i-cache space.
4) Aligned the K iteration loop.
- Thanks to Moore, Branden <Branden.Moore@amd.com> for these design
changes of DGEMM 6x8 native kernels.
- Additional change, reorganization of C update and store for
beta zero case to facilitate out of order execution of storing of C
matrix.
Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.
- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.
AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
a layer higher.
- Added a scalar loop to handle compute in case of non-unit strides.
This loop ensures functionality in case packing fails at the
framework level.
AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
- Updated the final reduction of partial sums( AVX-2 code section )
to use scalar accumulation entirely, instead of using the
_mm256_hadd_pd( ... ) intrinsic. This will in turn change the
associativity in the reduction step.
- Reverted to using scalar code on the fringe cases in AVX-2 kernel
for DNRM2 and DZNRM2, for improving functional correctness.
AMD-Internal: [CPUPL-4049]
Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.
- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.
- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.
- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.
AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
present in the bli_gemv_unf_var2_amd.c file, as part of the
bli_sgemv_unf_var2( ... ) function. This was changed to
bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
from 6 to 5 ), in accordance to the function pointer present
in the zen3 and zen4 context files.
- Changed the accumulator type to double from float, inside the
fringe loop for unit-strides(vectorized path) and non-unit strides
(scalar code).
AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
-The zero point data type is different based on the downscale data
type. For int8_t downscale type, zero point type is int8_t whereas for
uint8_t downscale type, it is uint8_t. During downscale post-op, the
micro-kernels upscales the zero point from its data type (int8_t or
uint8_t) to that of the accumulation data type and then performs the
zero point addition. The accumulated output is then stored as downscaled
type in a later storage phase. For the <u|s>8s8s16 micro-kernels, the
upscaling to int16_t (accumulation type) is always performed assuming
the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction.
However this will result in incorrect upscaled zero point values if the
downscale type is uint8_t and the associated zero point type is also
uint8_t. This issue is corrected by switching between the correct
upscale instruction based on the zero point type.
AMD-Internal: [SWLCSG-2500]
Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.
AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.
- Corrected A and B matrix buffer offsets, for cases where k > KC.
AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
- Updated the bli_dnormfv_unb_var1( ... ) and
bli_znormfv_unb_var1( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Added the logic to distribute the job among the threads such
that only one thread has to deal with fringe case(if required).
The remaining threads will execute only the AVX-2 code section
of the computational kernel.
- Added reduction logic post parallel region, to handle overflow
and/or underflow conditions as per the mandate. The reduction
for both the APIs involve calling the vectorized kernel of
dnormfv operation.
- Added changes to the kernel to have the scaling factors and
thresholds prebroadcasted onto the registers, instead of
broadcasting every time on a need basis.
- Non-unit stride cases are packed to be redirected to the
vectorized implementation. In case the packing fails, the
input is handled by the fringe case loop in the kernel.
- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
cases of size = 2 ( and ) size = 1 or non-unit strides respectively.
AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.
- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with very
small dimensions.
- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.
- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.
-blis_dgemm_tiny function supports single threaded computation as of now.
AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.
- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.
- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.
- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.
- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.
- Following kernels are modified:
bli_dgemmsup_rv_zen4_asm_24x4m,
bli_dgemmsup_rv_zen4_asm_24x3m,
bli_dgemmsup_rv_zen4_asm_24x2m,
bli_dgemmsup_rv_zen4_asm_24x1m,
bli_dgemmsup_rv_zen4_asm_24x1,
bli_dgemmsup_rv_zen4_asm_16x1,
bli_dgemmsup_rv_zen4_asm_8x1,
bli_dgemmsup_rv_zen4_asm_24x2,
bli_dgemmsup_rv_zen4_asm_16x2,
bli_dgemmsup_rv_zen4_asm_8x2,
bli_dgemmsup_rv_zen4_asm_24x3,
bli_dgemmsup_rv_zen4_asm_16x3,
bli_dgemmsup_rv_zen4_asm_8x3,
bli_dgemmsup_rv_zen4_asm_16x4,
bli_dgemmsup_rv_zen4_asm_8x4,
bli_dgemmsup_rv_zen4_asm_16x5,
bli_dgemmsup_rv_zen4_asm_8x5,
bli_dgemmsup_rv_zen4_asm_16x6,
bli_dgemmsup_rv_zen4_asm_8x6,
bli_dgemmsup_rv_zen4_asm_8x7,
bli_dgemmsup_rv_zen4_asm_8x8
AMD-Internal: [CPUPL-3574]
Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
- Added an explicit function definition for ZGEMV var 1. This
removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
instead of querying it.
- With this change fringe loop is vectorized using SSE
instructions.
AMD-Internal:[CPUPL-3997]
Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
logic. The kernel performs two different sets of compute based
on the value of alpha, for both unit and non-unit strides. There
are no constraints on beta scaling of the 'y' vector.
- Updated the logic to support 'x' conjugate in the computation.
The kernel supports conjugate/no conjugate operation through the
usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
intrinsics.
- Updated the early return condition in the kernel to adhere to
the standard compliance.
- Updated the scalar computation with vector computation(using 128
bit registers), in case of dealing with a single element(fringe case)
in unit-stride or vectors with non-unit strides. A single dcomplex
element occupies 128 bits in memory, thereby providing scope for
this optimization.
- Added accuracy and extreme value testing with sufficient sizes
and initializations, to test the required main and fringe cases
of the computation.
AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.
Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.
AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.
AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
- Incorrect ymm registers were used in dgemm SUP edge kernel,
while computing FMA operation.
- Due to incorrect vector register, it resulted into incorrect result.
- Corrected vector registers usage for FMA operation.
AMD-Internal: [CPUPL-3964]
Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df
- The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the
use of panel strides in case of packed matrices, thus resulting in
incorrect matrix strides when packing was explicitly enabled using
BLIS_PACK_A=1, BLIS_PACK_B=1 or both.
- The kernels are updated to use panel strides for traversing both A
and B matrix buffers accurately.
[AMD-Internal]: CPUPL-3673
Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.
- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.
- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.
- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.
- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.
AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- Added call to dsetv in dscalv. When DSCALV is invoked by
DGEMV the SCAL function is expected to SET the vector to
zero when alpha is 0. This change is done to ensure BLAS
compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.
AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
- In GEMV variant 1, the input matrix A is in row major. X vector
has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
operation inside the kernel has been ensured by packing the input X
vector to a temporary buffer with unit stride. Currently, the
packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
More missing clobbers in skx and zen4 kernels, missed in
previous commits.
AMD-Internal: [CPUPL-3521]
Change-Id: I838240f0539af4bf977a10d20302a40c34710858
- In variant 2 of GEMV, A matrix is in column major. Y vector has
to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
GEMV operation inside the kernel has been ensured by packing the
input Y vector to a temporary buffer with unit stride. As part of
the packing Y is scaled by beta to reduce the number of times Y
vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
buffer are copied to the original buffer and the temporary one is
released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
ZGEMM. The kernel is built for handling the GEMM computation
with inputs having k = 1, and the transpose values for A and
B as N.
- The kernel dimension has been changed from 4x6 to 4x4,
due to the following reasons :
- The 1xNR block of B in the n-loop can be reused over multiple
MRx1 blocks of A in the m-loop during computation. Similar
analogy exists for the fringe cases.
- Every 1xNR block of B was scaled with alpha and stored in
registers before traversing in the m-dimension. Similar change
was done for fringe cases in n-dimension.
- These registers should not be modified during compute, hence
the kernel dimension was changed from 4x6 to 4x4.
- The check for early exit(with regards to BLAS mandate) has been
removed, since it is already present in the BLAS layer.
- The check for parallel ZGEMM has been moved post the redirection to
this kernel, since the kernel is single-threaded.
- The bli_kernels_zen.h file was updated with the new kernel signature.
AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
instructions.
- Removed the negative incx handling checks from the blis_impli layer
of ZDSCAL as BLAS expects early return for incx <= 0.
AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.
AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.
AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52