- Following optimizations are included for dgemm 6x8 native kernel.
1) Reorganized the C update and store to reduce register dependencies.
2) moved the C prefetch to part-way through the kernel for efficiently
prefetching C matrix at appropriate distance.
3) Offsetting A matrix, so that kernel can use a smaller instruction
encoding saving, saving i-cache space.
4) Aligned the K iteration loop.
- Thanks to Moore, Branden <Branden.Moore@amd.com> for these design
changes of DGEMM 6x8 native kernels.
- Additional change, reorganization of C update and store for
beta zero case to facilitate out of order execution of storing of C
matrix.
Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba
- Added a conditional check to see if the vectorized kernels
for DNRM2_ and DZNRM2_ can be called directly, without
incurring any framework overhead.
- The condition to satisfy this fast-path is for the size to be
such that the ideal threads required is 1, with the vector having
unit stride( so that packing at the framework-level can be avoided ).
AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
1. Added input parameter checking for the extension APIs
1. gemm_pack_get_size API
2. gemm_pack API
2. Additionally added early returns for these APIs when
m or n dimensions are 0.
3. Routines for input parameter check for all the 3
BLAS extension APIs - gemm_pack_get_size, gemm_pack and
gemm_compute are defined in:
frame/compat/check/bla_gemm_pack_compute_check.h
4. Added AOCL DTL TRACE for all the functions of
1. gemm_pack_get_size
2. gemm_pack
3. gemm_compute
AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
Details:
- Moved the downscale & postop options from commmandline to
input file.
- Now the format of the input file is as follows:
dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops
- In case of no-postops, 'none' has to be passed in the place of
postops.
- Removed duplication of mat_mul_bench_main function for bf16 APIs.
- Added a function called print_matrix for each datatype which can
help in printing matrices while debugging.
- Added printing of ref, computed and diff values while reporting
failure.
- Added new functions for memory allocation and freeing. Different
types of memory allocation is chosen based on mode bench is
running(performance or accuracy mode).
Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72
The LPGEMM micro-kernel operates on blocks of dimension MRxKC and
KCxNR. Current LPGEMM design involves using all the available threads
for computing the output. If the number of threads assigned along ic
or jc direction is more than M/MR or N/NR blocks respectively, it
could results in threads sleeping due to the lack of MR or NR blocks.
This scenario is now handled by reducing the number of threads if
there are threads without any work (MR or NR blocks).
AMD-Internal: [SWLCSG-2354, SWLCSG-2389, SWLCSG-2267]
Change-Id: I74819337c7a0d3ab05ea0e18bb42780f977ea8f6
We are using pthread_self to get a thread id for use in the DTL
tracing functionality to name individual output files per thread.
This is not an appropriate use of pthread_self as its return type
(pthread_t) is an opaque type that can vary between implementations.
On linux we haven't had a problem, as pthread_t is an unsigned long
int. However on freeBSD it is a pointer to an empty struct. The
difference between this and the int type we used for its value within
the BLIS code was causing a compile error.
The best long term solution would be for pthread builds to maintain
their own internal thread id. A mechanism to implement this has not
yet been identifie. In the meantime, we make the following changes as
a stopgap:
- Explicitly cast from pthread_t return value to our BLIS internal
data type AOCL_TID.
- Make AOCL_TID a long int rather than pid_t (i.e. an int) in pthread
builds to match the sizes expected on both Linux and FreeBSD.
AMD-Internal: [CPUPL-4167]
Change-Id: Ia07ee8f97273cc3bab46f6bca1eeb7954320415b
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.
- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.
AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
- According to BLAS Standards, SCALV should return when incx .le. 0.
- To make SCALV compliant to this, added an early return inside the BLAS
layer, for the cases where incx <= 0.
- Also, added early return for the case where alpha is a unit scalar.
AMD-Internal: [CPUPL-3562]
Change-Id: Id474fdd6ed9232226f5c5381d0398f43384e4a49
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
a layer higher.
- Added a scalar loop to handle compute in case of non-unit strides.
This loop ensures functionality in case packing fails at the
framework level.
AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
- When configured for haswell config "Warning unused variable 'zero'"
was throwed during compilation.
- Removed zero variable which is not being used
AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
- Updated the final reduction of partial sums( AVX-2 code section )
to use scalar accumulation entirely, instead of using the
_mm256_hadd_pd( ... ) intrinsic. This will in turn change the
associativity in the reduction step.
- Reverted to using scalar code on the fringe cases in AVX-2 kernel
for DNRM2 and DZNRM2, for improving functional correctness.
AMD-Internal: [CPUPL-4049]
Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.
- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.
- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.
- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.
AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
- AOCL dynamic logic that determines the number of threads to be
launched has been modified.
AMD-Internal: [CPUPL-3956]
Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc
- Pack and compute are now compared against GEMM operation of reference
library when MKL is not used as a reference.
- For the case where both A and B are unpacked, the reference GEMM is
invoked with a unit-alpha scalar.
- If MKL is used as reference, then these APIs are compared against pack
and compute operations of MKL.
- Updated description in ref_gemm_compute.cpp to reflect this behavior.
AMD-Internal: [CPUPL-4084]
Change-Id: Id0521c9cad8743a7ae471a7f3c547ceb67191f86
- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
- BLAS compute checks updated to properly check for rs_c and cs_c.
- Updated BLAS compute checks to skip validity check if m==1 or n==1.
For the same reason, added a check just before to validate rs_c and
cs_c are greater than or equal to 1.
- Added tiny size tests to gtestsuite as a sanity check.
- Also updated the Invalid Input Tests to test for the updated checks.
AMD-Internal: [CPUPL-4140]
Change-Id: I984339ec7909778b58409ffcdbeed4ee33f28cfb
- Symbols for gemm_pack_get_size were not being exported properly when
BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
compute set of BLAS Extension APIs:
1. ?gemm_pack_get_size
2. ?gemm_pack
3. ?gemm_compute
AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
framework queries the architecture ID and calls the
vectorized kernel based on the architecture support.
- In case of not having the architecture support, we use
the default path based on the sumsqv method.
AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
Details:
- Modified bench to support testing for sizes where matrix
strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
check validity of strides in relation to the corresponding
dimension rather than checking if strides are equal to dimensions.
Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
present in the bli_gemv_unf_var2_amd.c file, as part of the
bli_sgemv_unf_var2( ... ) function. This was changed to
bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
from 6 to 5 ), in accordance to the function pointer present
in the zen3 and zen4 context files.
- Changed the accumulator type to double from float, inside the
fringe loop for unit-strides(vectorized path) and non-unit strides
(scalar code).
AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
-The zero point data type is different based on the downscale data
type. For int8_t downscale type, zero point type is int8_t whereas for
uint8_t downscale type, it is uint8_t. During downscale post-op, the
micro-kernels upscales the zero point from its data type (int8_t or
uint8_t) to that of the accumulation data type and then performs the
zero point addition. The accumulated output is then stored as downscaled
type in a later storage phase. For the <u|s>8s8s16 micro-kernels, the
upscaling to int16_t (accumulation type) is always performed assuming
the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction.
However this will result in incorrect upscaled zero point values if the
downscale type is uint8_t and the associated zero point type is also
uint8_t. This issue is corrected by switching between the correct
upscale instruction based on the zero point type.
AMD-Internal: [SWLCSG-2500]
Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative
to BLIS_ARCH_TYPE. The details are:
1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both
supported, with BLIS_ARCH_TYPE taking precedence if both are set.
2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4"
code paths respectively in AMD focused builds, or for "skx" and
"haswell" respectively in Intel focused builds. These names are
not case-sensitive.
3. BLIS_ARCH_TYPE specifies the code path to use. If this is
unsupported, e.g. zen4 code path on a Milan or earlier system,
that code path is still executed, likely resulting in an illegal
instruction error.
4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on
the system (for AVX2 and AVX512), and try a "lower" ISA option if
the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic.
5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set.
AMD-Internal: [CPUPL-4105]
Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315
Current BLIS makefile always uses the static library on Linux for
all BLIS test programs. This commit adds the option to use the shared
library instead by specifying e.g.
make checkblis USE_SHARED=yes
Executables are generated in different sub-directories for static
and shared libraries.
AMD-Internal: [CPUPL-4107]
Change-Id: I3ab5d505cfbc5f6ef47aa28fcbb846c52d56c3f2
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.
AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.
- Corrected A and B matrix buffer offsets, for cases where k > KC.
AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
Details:
- Updated pack function call in ic loop to accept correct params.
- Modified documentation in bench file to reflect updated usage of
bench for downscaled APIs.
- Modified memory allocation for C panel in BF16 APIs to use
BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool.
Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5
-When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_
<A_BLOCK|B_PANEL|C_PANEL>, the memory is allocated by checking out a
block from an internal memory pool. In order to ensure thread safety,
the memory pool checkout is protected using mutex (bli_pba_lock/
bli_pba_unlock). When the number of threads trying to checkout memory
(in parallel) are high, these locks tend to become a scaling bottleneck,
especially when the memory is to be used for non-packing purposes
(packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m
with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is
enabled for temporary C accumulation. This multi-threaded lock overhead
becomes prominent when m/n dimensions are relatively small, even when k
is large. In order to address this, bli_pba_acquire_m is used with
BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is
allocated using aligned malloc instead of checking out from memory pool.
Experiments have shown malloc costs to be far lower than memory pool
guarded by locks, especially for higher thread count.
-LPGEMM bench fixes for crash observed when benchmarking with post-ops
enabled and no downscale.
AMD-Internal: [SWLCSG-2354]
Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
- Added framework for unit testing of BLAS and CBLAS interfaces for the
Pack and Compute Extension APIs.
- These test the integrated functionality of the trio of
?gemm_pack_get_size(), ?gemm_pack() and ?gemm_compute() APIs.
- Note: Only MKL can be used as reference for now.
AMD-Internal: [CPUPL-3560]
Change-Id: I801654447a716da06c9ccf9db01d553817871571
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
is controlled by setting the environment variable
BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
assuming xerbla was not set to stop on error. Example call is
info = bli_info_get_info_value();
The default behaviour remains to print but don't stop on error,
i.e. the equivalent to
export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0
Implementation details:
- Values of the environment variables are stored and retrieved
from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
to 0 during initialization for all calls and updated by xerbla
if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
calls xerbla) will reinitialize info_value to 0 via call to
bli_thread_update_rntm_from_env
AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
- Updated the bli_dnormfv_unb_var1( ... ) and
bli_znormfv_unb_var1( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Added the logic to distribute the job among the threads such
that only one thread has to deal with fringe case(if required).
The remaining threads will execute only the AVX-2 code section
of the computational kernel.
- Added reduction logic post parallel region, to handle overflow
and/or underflow conditions as per the mandate. The reduction
for both the APIs involve calling the vectorized kernel of
dnormfv operation.
- Added changes to the kernel to have the scaling factors and
thresholds prebroadcasted onto the registers, instead of
broadcasting every time on a need basis.
- Non-unit stride cases are packed to be redirected to the
vectorized implementation. In case the packing fails, the
input is handled by the fringe case loop in the kernel.
- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
cases of size = 2 ( and ) size = 1 or non-unit strides respectively.
AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.
- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with very
small dimensions.
- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.
- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.
-blis_dgemm_tiny function supports single threaded computation as of now.
AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.
- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.
- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.
- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.
- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.
- Following kernels are modified:
bli_dgemmsup_rv_zen4_asm_24x4m,
bli_dgemmsup_rv_zen4_asm_24x3m,
bli_dgemmsup_rv_zen4_asm_24x2m,
bli_dgemmsup_rv_zen4_asm_24x1m,
bli_dgemmsup_rv_zen4_asm_24x1,
bli_dgemmsup_rv_zen4_asm_16x1,
bli_dgemmsup_rv_zen4_asm_8x1,
bli_dgemmsup_rv_zen4_asm_24x2,
bli_dgemmsup_rv_zen4_asm_16x2,
bli_dgemmsup_rv_zen4_asm_8x2,
bli_dgemmsup_rv_zen4_asm_24x3,
bli_dgemmsup_rv_zen4_asm_16x3,
bli_dgemmsup_rv_zen4_asm_8x3,
bli_dgemmsup_rv_zen4_asm_16x4,
bli_dgemmsup_rv_zen4_asm_8x4,
bli_dgemmsup_rv_zen4_asm_16x5,
bli_dgemmsup_rv_zen4_asm_8x5,
bli_dgemmsup_rv_zen4_asm_16x6,
bli_dgemmsup_rv_zen4_asm_8x6,
bli_dgemmsup_rv_zen4_asm_8x7,
bli_dgemmsup_rv_zen4_asm_8x8
AMD-Internal: [CPUPL-3574]
Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44
- Making A diagonally dominant to ensure that the problem at hand is solvable.
AMD-Internal: [CPUPL-3575]
Change-Id: I27cc76a212d4d10aacce880895e1e0d7532e4eb7
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
- Added an explicit function definition for ZGEMV var 1. This
removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
instead of querying it.
- With this change fringe loop is vectorized using SSE
instructions.
AMD-Internal:[CPUPL-3997]
Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
logic. The kernel performs two different sets of compute based
on the value of alpha, for both unit and non-unit strides. There
are no constraints on beta scaling of the 'y' vector.
- Updated the logic to support 'x' conjugate in the computation.
The kernel supports conjugate/no conjugate operation through the
usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
intrinsics.
- Updated the early return condition in the kernel to adhere to
the standard compliance.
- Updated the scalar computation with vector computation(using 128
bit registers), in case of dealing with a single element(fringe case)
in unit-stride or vectors with non-unit strides. A single dcomplex
element occupies 128 bits in memory, thereby providing scope for
this optimization.
- Added accuracy and extreme value testing with sufficient sizes
and initializations, to test the required main and fringe cases
of the computation.
AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
thresholds and modifications to build system.
Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.
Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.
AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4