AVX512 packing kernel supports:
1. Dcomplex datatype
2. Row and column major matrix
AVX512 packing kernel doesnot support:
1. General stride matrix
2. Fringe cases(only multiplies of 4 or 12 is supported)
3. Conjugate is not supported
scal2m will be used for above unsupported functionality
AVX512 packing kernel is column preferred kernel
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store it
AMD-Internal: [CPUPL-3088]
Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737
- Complex AXPBY kernels gave incorrect output when both alpha and
beta had non-zero imaginary parts.
- Previously, the scalar code (used to calculate remainder result
or non-unit increment cases) was directly accessing and updating
the y-vector pointer thus, resulting in an incorrect output.
Updated it to operate on a local copy of the currect y element
and store the final result to the y-pointer.
- Also, added operation to store temporary calculation of alpha*x
in an intermediate vector and then later added to the y vector.
AMD-Internal: [CPUPL-3037]
Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c
- Added in-register transpose support for c matrix to
support row stored C matrix for dgemm sup.
- Support is added for all edge case kernels.
- FMA are made independent of each other, for faster
computation while storing data back to C matrix.
AMD-Internal: [CPUPL-2966]
Change-Id: I1d13af99a17ee66adbf5f537a4664ade489a7cad
- Main kernel is of size 24x8 and the associated fringe kernels
added are
- 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m
- 24x8, 24x7, 24x6, 24x5, 24x4, 24x3, 24x2, 24x1
- 16x8, 16x7, 16x6, 16x5, 16x4, 16x3, 16x2, 16x1
- 8x8, 8x7, 8x6, 8x5, 8x4, 8x3, 8x2, 8x1
- For fringe kernels, 24x? kernel handles 16 < m_remainder < 24
16x? kernel handles 8 < m_remainder <= 16
8x? kernel handles 0 < m_remainder <= 8
- Added a function 'bli_zen4_override_gemm_blkszs' to override
blocksizes and kernels to be used for SUP for supported storage
schemes.
- Updated the zen4 config to enable these kernels in zen4 path.
- Thresholds are yet to be derived.
- Updated CMakeLists.txt with DGEMM SUP kernels for windows build.
Kernel-specific details:
- K-loop is unrolled by 8 times to facilitate prefetch of B.
- For every load of one column of A, the corresponding column in
next panel of A is prefetched with T1 hint.
- One column of C is prefetched with T0 hint per iteration of LOOP2.
- TAIL_NITER is derived to be 3.
- For every unroll of k-loop, one row of B is prefetched with T0 hint.
- C-prefetching for row-storage is yet to be added.
- B-prefetching for col-storage is yet to be added.
- Support for C transpose is yet to added.
AMD-Internal: [CPUPL-2755], [CPUPL-2409]
Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.
AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
- RBP is base pointer which points to base of current stack frame.
ASAN tool rely on rbp and rsp for stack related validations. So over-writting
or modifying RBP register results in application termination with the error code
of stack overflow.
- Removed all the code snippets which were using rbp register for prefetching matrices
and sometimes loading elements from memory in all of the gemm sup kernels for double
datatype.
- Removed reference to rbp from register clobber list as well to completely avoid the
usage of rbp register.
AMD-Internal: [CPUPL-2613, CPUPL-2587]
Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28
- Enabled DTRSM small mt for sizes where performance is better
than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
to bli_trsm_small so that one function pointer can be used for
all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
removed so that the sizes for which small path to take can be
decided on bla layer instead of inside kernel.
AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
- Added DGEMM and DTRSM row preferred micro kernels.
- DTRSM left lower and left upper micro kernels are added.
- DGEMM kernel is optimized for both row stored C and col
stored C.
AMD-Internal: [CPUPL-2745]
Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180
- In cases when incy != 1, a buffer is created for y vector. The
contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
based architecture, AVX2 kernels are invoked. For other, the
kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.
AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
- The new AMAXV adheres to the BLAS definition of ISAMAX by not handle
NaN separately. In the previous kernel, NaN is considered the smallest
element of all the elements in the array.
- The new logic uses two helper functions - bli_vec_absmax_double and
bli_vec_search_double.
- bli_vec_absmax_double finds the absolute largest element and the index
range in which the first occurence of this element can be found.
- bli_vec_search_double returns the index of the first occurence of the
absolute value of an element.
- AMAXV uses these two helper functions to find the absolute largest
element and then searches using bli_vec_search_double in the reduced
range provided by bli_vec_absmax_double.
- Added condition check for n == 1 in BLAS layer. It is an optimization
mention in the BLAS standard API definition.
- Removed redundant n == 0 condition check from the kernel. This is a
BLAS exception and is already done in the BLAS layer.
- Removed AVX2 flag check from the BLAS layer. Kernels will be picked
based on the architecture ID in the new design.
AMD-Internal: [CPUPL-2773]
Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.
AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
error function - ERF
2. Added error function based GeLU activation post-ops for the
S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
addition to adding the marco based function definations of the
ERF function in the math-utils and gelu headerfiles.
AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
- Extended the existing support for handling beta scaling
in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The
added support ensures that NaN values initialized in C do
not propogate to the result when beta is 0.
- The support has been added to fringe cases common to, as well
as specific to the m and n variants of the RD kernels.
AMD-Internal: [CPUPL-3053] [SWLCSG-1900]
Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9
- Since the definition of negative increments is different between BLAS and BLIS,
there was bug on how the memory was accessed when we were copying the elements
of a vector with negative increments. Updated the code under the assumption that
when negative increments are set, the vector is being accessed starting from the end.
For the BLAS interface, there is an intermediate conversion before calling into the blis layer.
Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285
AMD-Internal: [SWLCSG-1900]
- Increased unroll by reusing X registers that was previously used
for performing shuffle.
- Added loops with smaller increment steps for better problem
decomposition.
- Added X vector and Y vector prefetch to the kernel.
- Removed redundant code that handles fringe in incx = 1 and
incy = 1. This remainder will be performed by the loop that handles
non-unit stride cases.
- Vectorized loops that handle non-unit stride cases using SSE
instructions.
AMD-Internal: [CPUPL-2773]
Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.
AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero
or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.
AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
- Bias add, relu, parametric relu and gelu post-ops support added in all
f32 gemm micro-kernels. These post-ops are implemented for both AVX512
and AVX2 ISA based on the micro-kernel flavor. The support is added for
both row and column major cases.
- Lpgemm bench updates to support f32 post-ops.
AMD-Internal: [CPUPL-3032]
Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476
- Added k-loop unrolling by a factor of 4 to the following SGEMM
SUP RV kernels:
- 5x48, 5x32, 5x16
- 4x64, 4x48, 3x32, 4x16
- 3x48, 3x32, 3x16
- 2x64, 2x48, 2x32, 2x16
- 1x64, 1x48, 1x32, 1x16
- 6x64n, 5x64n, 3x64n, 2x64n, 1x64n
- Removed unused variables which were resulting in warnings during
compilation.
- Added a newline at the end of header files to resolve warnings
shown during compilation.
AMD-Internal: [CPUPL-3002]
Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
- Removed repetitive function declaration of GEMM small kernels
from the C files
- Function declaration of these kernels exist in the header files
where the kernels are supposed to be declared.
AMD-Internal: [CPUPL-3003]
Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63
- Currently the pointer received as function argument is
used for packing which causes only a partial copy of
input buffer to output buffer due to strange optimizations
by compiler.
- To fix this, instead of using a normal pointer for output
buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
buffer, the mysterious problem of incomplete copy has
been solved.
Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646
- Added kernels for all rv and rd variants.
- Main kernel is of size 6x64, and the associated fringe kernels
added are
- 4x64, 2x64, 1x64
- 6x32, 4x32, 2x32, 1x32
- 6x16, 4x16, 2x16, 1x16
- Updated the zen4 config to enable these kernels in zen4 path.
- Added C-prefetching to 6x? row-stored main kernels.
- C-prefetching for column storage yet to be added.
- K-loop unrolling for fringe kernels yet to be added.
AMD-Internal: [CPUPL-3002]
Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
- 8x8 kernels are used for DTRSM SMALL
- Matrix A(a10) is packed for GEMM operations.
- Packed martix A will be re-used in all the col-block
along N-dimension.
- Diagonal elements of A matrix are packed(a11) for
TRSM operations.
- Implemented fringe cases with following block sizes
8x8, 8x4, 8x3, 8x2, 8x1
4x8, 4x4, 4x3, 4x2, 4x1
3x8, 3x4, 3x3, 3x2, 3x1
2x8, 2x4, 2x3, 2x2, 2x1
1x8, 1x4, 1x3, 1x2, 1x1
AMD-Internal: [CPUPL-2745]
Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
to the function in place of context. Expert interface of ZSETV will
query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
BLAS macro interface only for single complex type and single complex,
float mixed type
AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
- Vectorized alpha scaling of X vector using SSE instructions. This
can be done irrespective of incx.
- Added code to prefetch A matrix and Y vector to L1 cache
- Vectorized fringe case computation and non-unit stride computation
with SSE instructions.
- Increased unroll in unit stride cases for better register
utilization.
AMD-Internal: [CPUPL-2773]
Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]
Change-Id: I576282382504b72072a6db068eabd164c8943627
Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with
k0 is typecasted to uint64_t to enable AOCC generate optimized code.
Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting
this change. Similar change was applied to sgemm, cgemm and zgemm kernels.
Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments
AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
-Implemented (r)ow preferential (d)ot product milli-kernels
(m and n variants) for dcomplex datatype along SUP path.
-These computational kernels extend the support for handling RRC and
CRC storage schemes along the SUP path. In case of BLAS api call,
it corresponds to the input cases with transa equal to T and
transb equal to N.
-In case of the B matrix being packed(conditionally), the inputs are
redirected to the existing (r)ow preferential (v)ector load optimized
kernels due to better performance.
-Added macro for vhsubpd assembly instruction, to support the arithmetic
for complex datatype in its interleaved storage.
AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
- Only for RRR case, var2m kernels are added
- Main kernel is of 12x32 (AVX512), associated fringe kernels of
- 8x32, 4x32, 2x32, 1x32 (AVX512)
- 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512)
- 12x8, 8x8 (AVX2)
- 12x4, 8x4 (SSE4)
- 12x2, 8x2 (SSE4)
- existing AVX2/SSE4 kernels are used for other fringe
cases
- Currently, these kernels are not invoked in zen4 path
- Once all AVX512 kernels (n and rd) are done, invoke all of them
together in zen4 config
AMD-Internal: [CPUPL-2801]
Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [SWLCSG-1080]
Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.
Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.
Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.
Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag
AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
1. Check OpenMP active level against max active levels when setting
number of threads for starting a new parallel region in
./frame/thread/bli_thread.c to ensure the correct number of threads
is used when BLIS is called within nested OpenMP parallelism.
2. In subsequent BLIS calls, threading choices could be incorrectly
set based on values used and stored in global_rntm by a previous
call. This could apply when the OpenMP number of threads differ from
call to call, different nested parallelism is used in different
parts of a user's code, or different threads at the user level
request different numbers of OpenMP threads for BLIS calls.
Keep threading information in both global_rntm and a new Thread
Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
environment (as appropriate) during bli_init_auto() calls in each
BLIS routine. The details are:
* global_rntm is initialized on first BLIS call based on OpenMP and
BLIS threading environment variables.
* global_rntm is updated by any BLIS threading function calls.
* In bli_thread_update_tl(), called by bli_init_auto(), sync with
any BLIS values set or updated in global_rntm. Then, if BLIS
threading control is not used, check OpenMP ICVs and set thread
count and auto_factor appropriately.
* Setting BLIS threading locally (using expert interfaces to pass
a user defined rntm data structure) should work as before.
3. bli_thread_get_is_parallel can now only be called outside of
parallelism within BLIS routines. Change calls in trsm to reflect
this.
4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
if any BLIS_*_NT environment variables are set.
5. Set auto_factor = FALSE when the number of threads is 1.
6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.
7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().
8. For debugging, internal information on the rntm threading data can
be printed by defining "PRINT_THREADING" at the top of bli_rntm.h
9. bli_rntm_print() now also prints the value of blis_mt.
10. Function prototypes in bli_rntm.h moved to top of file, so that
bli_rntm_print() can be used within inline functions defined in
this header file.
11. Comment out bli_init_auto() and bli_finalize_auto() calls in
Fortran interfaces in frame/compat/blis/thread/b77_thread.c
12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
set_pack_b functions outside of the auto_factor if statements.
13. Misc code tidying.
AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.
AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
Description:
1. n value per thread and offset are defined outside omp loop
which can vary for each thread. Moved them to inside the
omp loop so that the output
2. Fixed few compilation warnings in dnrm2 avx2 implemenation
AMD Internal:[CPUPL-2606]
Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
Address sanitizer reports error when rbp regitser is modified.
Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.
This fix address the issue reported by GEQP3 API of libflame
AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
- Cleaned up some white space in the AVX2 implementation for DNRM2.
AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.
AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.
AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.
AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
1. Corrected B buffer accessing to access by its offset instead of
starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
dimension based on side right or left. Hence need to access by its
offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
complete matrix A is used by all threads.
Incase of design change in future, modified A buffer accessing by its
offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]
Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
Details:
1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction
is used to load B matrix in k loop and k last loop,
which is loading 128 bit into xmm than 64 bit as expected.
2. Changed vmovps instruction to vmovsd instrucntions
which load only 64 bit in xmm register
3. Avoided C memory access by vfma instruction when multiplying
with non-beta at corner cases with required access to 128 bit
which leads to out of bound. Replaced with vmovq first to
get 64 bit data then peformed vfma on xmm register in rv_6x8m
and rv_6x4m
AMD-Internal: [CPUPL-2472]
Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.
AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8