More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
This reverts commit a028108cbb.
Reason for revert: With libgomp, scalability issues were observed with a
higher number of threads, leading to the use of fewer
threads. However, with different OpenMP libraries
like libomp, this scalability issue was not observed,
and using fewer threads resulted in performance loss.
The AOCL dynamic logic has been updated to select a
higher number of threads, considering the iomp OpenMP
library.
Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f
- Added a conditional check to invoke the vectorized
DCOPYV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal for
single-threaded execution. Thus, we avoid the call to
bli_nthreads_l1() function to set the ideal number of threads.
- Used macros to protect the declaration of fast_path_thresh in
DAXPYV API to avoid compiler warnings.
AMD-Internal: [CPUPL-4875][CPUPL-5895]
Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb
- Mixed precision datatypes use a modified cntx.
- For some variants of mixed precision, complex and real blocksizes
are needed to be same. This is achieved by creating a local copy of
cntx and copying complex blocksizes onto real blocksizes.
- By using the dynamic blocksizes, the changes made to the
blocksizes for mixed precision are overwritten by changes made
by dynamic blocksizes.
- This mismatch between complex and real blocksizes is causing a issue
where the pack buffer is allocated based on complex blocksizes but
amount of data packed is based on real blocksizes.
- This makes the pack buffer sizes smaller than the required sizes.
- To fix this, dynamic blocksizes are disabled for mixed precision.
AMD-Internal: [CPUPL-6384]
Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83
- Reduced the blocking size of 'bli_ddotv_zen_int10'
kernel from 40 elements to 20 elements for better
utilization of vector registers
- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
kernel with 'if' conditions to handle reminder
iterations. As only a single iteration is used when
reminder is less than the primary unroll factor.
- Added a conditional check to invoke the vectorized
DDOTV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal
for single-threaded execution. Thus, we avoid the
call to bli_nthreads_l1() function to set the ideal
number of threads.
- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
kernel.
AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
- Replaced switch case with if else, lookup table for switch case
is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.
AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
- This patch reverts the previous changes that removed the enforcement
of dgemm inputs under a certain threshold to be processed by kernels
selected based on architecture ID and handled in single-threaded mode.
- This change is now forcing such small inputs to be computed in tiny
path. Previously when this check was not there, it was routing these
inputs to SUP path and causing performance regression due to framework
overhead.
AMD-Internal: [CPUPL-5927]
Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
AVX2 kernels for Zen1/2/3 architectures. These kernels are written
from the ground up and are independent of fused kernels.
- The DGEMV primary kernel processes the calculation in chunks of
8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
kernels, which are invoked by the primary kernel as needed.
- Implemented the kernels by computing the dot product of matrix A
columns with vector x in chunks of 32 elements, storing the results
in accumulator registers. Fringe elements are handled in chunks
of 16, 8, etc. The data in the accumulator registers is then reduced
and added to vector y.
AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
incurred when the single-threaded path is optimally performant.
AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
We want bli_thread_get_num_threads() and bli_thread_get_*_nt()
to report the threading values modified to reflect what will
be in effect given OpenMP nesting and active levels. This was
lost in commit 0c6d006225 for
bli_thread_get_num_threads() and wasn't previously implemented
in bli_thread_get_*_nt()
AMD-Internal: [CPUPL-6168]
Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd
- Added a conditional check to invoke the vectorized
DAXPYV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal for
single-threaded execution. Thus, we avoid the call to
bli_nthreads_l1() function to set the ideal number of threads.
AMD-Internal: [CPUPL-4878]
Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb
- Further updated the thresholds for entry to ZGEMM small
path(AVX2), when the execution is mulithreaded. The newer
thresholds account for more skinnier inputs, compatible with
single-threaded small path, as opposed to multithreaded
SUP path.
AMD-Internal: [CPUPL-6040][CPUPL-5930]
Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6
Create and export Fortran interfaces for bli_thread_get_num_threads()
and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs.
bli_thread_get_is_parallel() is intended for internal BLIS usage, so
not adding a Fortran interfaces for it at this time.
AMD-Internal: [CPUPL-6168]
Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1
When compiling with config generic (or any non-zen build),
the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny()
is only used within amd specific file, bli_tiny_gemm.c has been renamed
to bli_tiny_gemm_amd.c to reflect its specific usage.
Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue.
Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
incorrect functionality with multiple absolute maximums.
They returned the last index when having multiple occurences,
instead of the first one.
- Implemented a bug-fix to handle this issue on these AVX512
kernels. Also ensured that the kernels are compliant with
the standard when handling exception values.
- Further optimized the code by decoupling the logic to find
the maximum element and its search space for index. This way,
we use lesser latency instructions to compute the maximum
first.
- Updated the unit-tests, exception value tests and early return
tests for the API to ensure code-coverage.
AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
- Scoped some of the variables used in zgemm_blis_impl()
when determining the thresholds to small path. These
variables will be used only when the architecture is
ZEN5 or ZEN4.
AMD-Internal: [CPUPL-5895]
Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13
In preparation for merging next group of changes from upstream BLIS,
move some BLAS extension APIs to new extra subdirectories in
frame/compat and frame/compat/cblas/src. Other extension APIs will
be moved in later commits.
Some tidying up to better match upstream BLIS code has also been done.
AMD-Internal: [CPUPL-2698]
Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb
Change usage of global_rntm and tl_rntm to elimate need
for mutex operations when accessing global_rntm. Usage of
these data structures is now as follows:
* global_rntm is set once during bli_init_apis and includes
all getenv calls to check BLIS threading and error printing
environment variables. global_rntm is then read-only.
* tl_rntm is intialized once from global_rntm on each
application thread. Any calls to BLIS set threading/ways
APIs will update tl_rntm for that application thread only
(Previously they updated global_rntm for all application threads).
* Re-initialize info_value in tl_rntm in every call to bli_init APIs.
* In bli_rntm_init_from_global() we initialize the local (per API
call) rntm as a copy of tl_rntm and then update threading values
in bli_thread_update_rntm_from_env() to reflect the current status
of OpenMP runtime ICVs.
AMD-Internal: [CPUPL-6168][SWLCSG-3143]
Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6
- Added new DTRSM kernels for right and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
- Updated the threshold check for ZGEMM small path to include
runtime checks for redirection, specific to the micro-architecture.
- The current ZGEMM small path has only its AVX2 variant available.
Post implementing an AVX512(same/different algorithm), the thresholds
will further be fine-tuned.
- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
context. It will be used as part of handling RRC and CRC storage
schemes of C, A and B matrices in both single-thread and multi-thread
runs.
AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
bli_nthreads_optimum is exported and called directly by AOCL libFLAME,
however it was only defined if building a multithreaded BLIS library
with AOCL_DYNAMIC enabled. Change to always define this function. If
BLIS is serial or if AOCL_DYNAMIC is disabled, this function returns
without modifying the supplied rntm.
Change-Id: Ie65690e9e6ec2a8ea77b3778f96676a68e6260be
- AVX512 specific DGEMV native kernels are added for Zen4/5
architectures to handle the NO_TRANSPOSE cases and are independent of
the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
beta scaling of y vector within the kernel itself and handle cases
where n is less than 5:
- bli_dgemv_n_zen_int_32x8n_avx512( ... )
- bli_dgemv_n_zen_int_32x4n_avx512( ... )
- bli_dgemv_n_zen_int_32x2n_avx512( ... )
- bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
m-dimension and for this kernel beta scaling is handled beforehand
within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
using fused kernel, namely AXPYF, to perform the GEMV operation.
AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
- Added a kernel selection logic based on the input
dimension(runtime parameter), to choose between
deploying AVX2 or AVX512 computational kernel for
single-thread execution.
- An empirical analysis was conducted to arrive at the
thresholds, for ZEN4 and ZEN5 architectures.
- Updated the fast-path threshold for ZEN4 to be in hand
with the tipping points of its dynamic thread-setter(used
when AOCL_DYNAMIC is enabled).
AMD-Internal: [CPUPL-5937]
Change-Id: I96d7f167658c9e25a0098c4c67e12e4ba673e228
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
- Generic kernel is used if N is not multiple of NR
or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
macro kernel so that macro kernel can be used for
all problem sizes.
AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
In the function bli_thread_update_rntm_from_env()mutex is used for reading global_rntm
"bli_pthread_mutex_lock( &global_rntm_mutex );" This causes regression when application is
Multithreaded. The cause of this regression is due to these mutexes, Imagine a scenario
two threads launched, one thread acquires this mutex, second thread stalls till mutex is
freed by first thread, as a result second thread will be slower to arrive at openmp barrier
in application thereby increasing the openmp barrier overhead.
Things get worst when more number of threads are launched.
Thanks to rocHPL for sharing standalone panelfact application to reproduce this issue.
Thanks to @Edward Symth (edward.smyth@amd.com) for finding this bug.
[SWLCSG-3143]
- Added the appropriate CBLAS wrappers for CROTG, CSROT,
ZROTG and ZDROT APIs. These would internally call their
?_blis_impl() layer.
AMD-Internal: [CPUPL-5813]
Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97
- Optimized DGEMM macro kernel does not
support mixed precision.
- This kernel was being used for solving
some of the mixed precision problems.
- Currently only ( bli_obj_elem(A) == 8 ) is used for checking
if the problem being solved is mixed precision.
- bli_obj_elem(A) will be equal to 8 for both double precision
data type and mixed precision case single-complex.
- Added extra checks (bli_obj_is_real( a )) to make sure that
A and B are real and DGEMM macro kernel is being used only
for DDDGEMM.
AMD-Internal: [CPUPL-5804]
Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
The _blis_impl layer provide a BLAS-like API for use in builds
where BLAS and CBLAS interfaces are not desirable. This patch
generates interfaces in uppercase and with and without trailing
underscores, to match what is generated for the regular BLAS
interface.
AMD-Internal: [CPUPL-5650]
Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2
- Reverted the change done for tuning ddotv API. When number of threads
is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads
are not calculated and as a result number of threads value is -1.
OpenMP threads are launched with -1 value. This results in crash.
This bug is fixed by correctly calculating number of threads.
AMD-Internal: [SWLCSG-3028][CPUPL-5689]
Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a
Different Zen processors may have a 512-bit, 256-bit or 128-bit
FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows
a selection of FP512 or FP256 width in BIOS settings. Add cpuid
code to detect the width and store an indication of it in the
global variable bli_fp_datapath. This should be accessed internally
via the function bli_cpuid_query_fp_datapath(). This functionality
is currently only enabled on x86_64 platforms and only currently
reports a value for AMD CPUs.
Also add Zen3 as a fallback path for any unknown AMD processors if
AVX512 is not supported or has been disabled.
AMD-Internal: [CPUPL-4415]
Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a
- Bug : For non-zen architectures, {D/C/Z}AXPBY had
incorrect datatypes passed when querying the computational
kernel from context. The right datatype is now passed to
each variant.
- Bug : For ZAXPY, a NULL context was passed to the kernel
when using the single-threaded path. In case of further
using the context inside the kernel, this would be an issue.
We now pass the context instead of a null pointer.
AMD-Internal: [CPUPL-5643]
Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.
AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
- Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel)
for zen5 do not support alpha scaling. Alpha scaling is
supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24).
- Optimized macro kernel expects alpha scaling to be done during
packing. The packing kernel used for mixed precision do not support
alpha scaling. Therefore, the optimized Zen5 macro kernel is not
compatible with existing packing logic.
- Changes have been made to use the generic macro kernel which in turn
used zen5 micro kernel for mixed precision which supports alpha scaling.
AMD-Internal: [CPUPL-5058]
Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.
Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.
AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
- C<- alpha * op(A) *op(B) + beta *C.
C(nxn) - A(n x k) * B(k x n)
For ZEN4 and ZEN5
DGEMM is col-preferred kernel
DGEMMT = DGEMM + DGEMMT
DGEMM is col-preferred and DGEMMT is row-preferred.
DGEMM is evaluated as C = A*B (all col-storage)
whereas DGEMMT is evaluated as C = B * A (row-storage).
When A is packed it is packed as row-panels with col-stored elements.
So DGEMM is evaluated as C = A*B (A is col-stored) it aligns
with col-stored preference.
For DGEMMT: C = B * A, here A will become col-stored
because of packingand as result it will break the DGEMMT
kernel assumption that A is row-storage.
- Fixed this by disabling this optimization for ZEN4
and ZEN5.
AMD-Internal: [CPUPL-5542}
Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379
- Added reference kernel for dgemv that handles computation for tiny
sizes (m < 8 && n < 8).
- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
row/column storage schemes as well as transpose and no transpose
cases.
- Added additional unit-tests for functional verification.
AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
- Avoid performance degradation of dscalv for ST when OpenMP is enabled
by using fast-path to skip the overhead caused by 'bli_nthreads_l1'
function if the input size is less than a particular threshold.
- Replaced 'bli_thread_vector_partition' work distribution function
with 'bli_thread_range_sub'.
AMD-Internal: [CPUPL-5522]
Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f