Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.
AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.
AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
1. Custom Clip is a post-op which is used to clip the
accumulated GEMM output within a certain range.
2. This post-op is implemented for u8s8s32os32/os8 and
s8s8s32os32/os8 LPGEMM types.
3. Changes are done at the microkernel level for these
2 APIs to support Clip Post-Op
AMD-Internal: [CPUPL-3207]
Change-Id: I8b4da5807de6a93711b0ae9343970c55192f75d4
Details:
- Added a new function for choosing between SUP and
native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
- one dimension is small
- Two dimensions are small
- All dimensions are small
- Added different threshold conditions for each of the
categories.
AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. Custom Clip is an element-wise post-op which is used to
clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
to enable custom clipping for various LPGEMM types and SGEMM
AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6
- Set the variables to zero to avoid the compiler warning
(-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
bli_trsm_small_AVX512.c
- Changed the datatype from dim_t to siz_t for i,k,j
in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
avoid the compiler warning (-Waggressive-loop-optimizations)
AMD-Internal: [CPUPL-2870]
Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
Threading related changes
--------------------------
- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
DDOTV. This function contains the AOCL dynamic logic for the
respective kernels.
- Added handling for cases when number of elements (n) is less than
number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
amount of work the calling thread is supposed to perform on a
vector.
Interface changes
-----------------
- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
and only the real part of the alpha passed is used inside the kernel.
The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
Without this multithreaded code might crash because of minimum work
calculation.
Misc
-----
- Removed unused variables from ZSCAL2V and AXPYV kernels.
AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
- Added AVX512 based double and float AXPYV which will be used in
Zen4 context.
- Added n <= 0 check and alpha == 0 check to the BLAS layer of
SAXPY.
- Modified BLAS framework of float AXPYV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2793]
Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1
-As part of an earlier optimization, the memcpy function call in k
fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe
(n < 16 - beta scale and C store) were replaced with copy macros
specifically optimized for less than 4 and 16 elements each. However
upon further analysis it was observed that masked load/broadcast and
masked store performed better on average than the copy macros. The copy
macros contained more if conditions, which resulted in more branching
and thus resulting in perf variations. It was also noted that code
generation varied a lot based on the compilers when using the copy
macros due to the extra conditional code.
-As part of this change, the copy macros are completely replaced with
masked load/broadcast/store. Performance was observed to be better and
less prone to variations for the k fringe and n fringe (< 16) cases.
AMD-Internal: [CPUPL-3173]
Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503
- 8x8 kernels are used for DTRSM SMALL
- Matrix A(a10) is packed for GEMM operations.
- Packed martix A will be re-used in all the col-block
along N-dimension.
- Diagonal elements of A matrix are packed(a11) for
TRSM operations.
- Implemented fringe cases with below block sizes
8x8, 8x4, 8x3, 8x2, 8x1
4x8, 4x4, 4x3, 4x2, 4x1
3x8, 3x4, 3x3, 3x2, 3x1
2x8, 2x4, 2x3, 2x2, 2x1
1x8, 1x4, 1x3, 1x2, 1x1
AMD-Internal: [CPUPL-2745]
Change-Id: I5bb57501f6d3783eb654e375d63901467dd14734
- Added AVX512 based double and float DOTV which will be used in
Zen4 context.
- Added n <= 0 check to the BLAS layer of SDOTV.
- Modified BLAS framework of float DOTV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2800]
Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702
- Added AVX512 based double and float SCALV which will be used in
Zen4 context.
- Added incx <= 0 check and alpha == 1 check to the BLAS layer of
SSCAL.
- Modified BLAS framework of float SCAL to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2766],[CPUPL-2765]
Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc
- on windows 24xk kernel is compiled without avx512 flag which
causes out of bounds writes for DTRSM.
- to fix this avx512 flag has been added to the CMakeLists.txt
file for 24xk kernel.
AMD-Internal: [CPUPL-3186]
Change-Id: I0314dea88302fc4964a303853a4b9b719ecd8064
Modify code to correct some warning messages from GCC 12.2 or
AOCC 4.0:
- Increase size of nbuf in blastest/f2c/endfile.c
- Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c
and kernels/zen/1/bli_axpyv_zen_int10.c
- Remove extraneous parentheses in frame/compat/bla_trsm_amd.c
and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c
- Add __attribute__ ((unused)) to several variables in
frame/1m/packm/bli_packm_struc_cxk.c and
frame/1m/packm/bli_packm_struc_cxk_md.c
AMD-Internal: [CPUPL-2870]
Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d
- Implemented 12x4m column preferential SUP kernels(main and fringe
cases). The main kernel dimension is 12x4, and the associated fringe
kernel dimensions are : 12x3m, 12x2m, 12x1m
8x4, 8x3, 8x2, 8x1
4x4, 4x3, 4x2, 4x1
2x4, 2x3, 2x2, 2x1.
- Included in-register transposition support for C, thus extending
the storage scheme supports to CCC, CCR, RCC and RCR inside the
milli-kernel.
- Integrated conditional packing of A onto the SUP front end for
dcomplex datatype. This redirects RRC and CRC storage schemes
onto the preceding set of SUP kernels through storage scheme
transformation(RCC and CCC respectively).
- Updated the zen4 context file with the new set of SUP kernels, to
get enabled appropriately. Furthermore, the context file was updated
with the AVX-2 dotxv signatures for dcomplex datatype. This redirects
the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines.
- Added C prefetching onto L2-cache, and an unroll factor of 4 for the
k loop in all the kernels.
- Work in progress to include conjugate support and input spectrum
extension for the AVX-512 SUP kernels. The current thresholds in zen4
context is the same as that of the zen3 thresholds for ZGEMM SUP.
AMD-Internal: [CPUPL-3122]
Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.
AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
1. New LPGEMM type - S8S8S32/S8 is added.
2. New interface, frame and kernel files are added.
3. Frame and kernel files added/modified for S8S8S32/S8 have
2 operations - Pack B and Mat Mul
4. Pack B kernel routines to pack B matrix for VNNI and compute the sum
of every column of B matrix to implement the S8S8S32 operation using
the VNNI instructions.
5. Mat Mul Kernel files to compute the GEMM output using the VNNI.
Here the A matrix elements are converted from int8 to uint8 (VNNI
works with A matrix type uint8 only).
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s32os32 and s8s8s32os8.
8. All previously added post-ops are supported on S8S8S32/S8 also.
AMD-Internal: [CPUPL-3154]
Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3
- Kernel block size is 12x4
- Updated the zen4 config to enable these kernels in zen4 path.
- Tuned MC,KC,NC for better performance for m/n/k size > 500
- Updated CMakeLists.txt with ZGEMM kernels for windows build.
Kernel supports:
1. Preload and prebroadcast of A and B
2. Prefecth of C Matrix
3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs.
4. Special case when alpha/beta imag component is zero
5. Row/Col/General stride of Matrix C
AMD-Internal: [CPUPL-2998]
Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440
AVX512 optimised kernel for Double datatype supports
row and column major matrix
Packing kernel is column major implementation
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store
AMD-Internal: [CPUPL-2966]
Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9
AVX512 packing kernel supports:
1. Dcomplex datatype
2. Row and column major matrix
AVX512 packing kernel doesnot support:
1. General stride matrix
2. Fringe cases(only multiplies of 4 or 12 is supported)
3. Conjugate is not supported
scal2m will be used for above unsupported functionality
AVX512 packing kernel is column preferred kernel
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store it
AMD-Internal: [CPUPL-3088]
Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737
- Complex AXPBY kernels gave incorrect output when both alpha and
beta had non-zero imaginary parts.
- Previously, the scalar code (used to calculate remainder result
or non-unit increment cases) was directly accessing and updating
the y-vector pointer thus, resulting in an incorrect output.
Updated it to operate on a local copy of the currect y element
and store the final result to the y-pointer.
- Also, added operation to store temporary calculation of alpha*x
in an intermediate vector and then later added to the y vector.
AMD-Internal: [CPUPL-3037]
Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c
- Added in-register transpose support for c matrix to
support row stored C matrix for dgemm sup.
- Support is added for all edge case kernels.
- FMA are made independent of each other, for faster
computation while storing data back to C matrix.
AMD-Internal: [CPUPL-2966]
Change-Id: I1d13af99a17ee66adbf5f537a4664ade489a7cad
- Main kernel is of size 24x8 and the associated fringe kernels
added are
- 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m
- 24x8, 24x7, 24x6, 24x5, 24x4, 24x3, 24x2, 24x1
- 16x8, 16x7, 16x6, 16x5, 16x4, 16x3, 16x2, 16x1
- 8x8, 8x7, 8x6, 8x5, 8x4, 8x3, 8x2, 8x1
- For fringe kernels, 24x? kernel handles 16 < m_remainder < 24
16x? kernel handles 8 < m_remainder <= 16
8x? kernel handles 0 < m_remainder <= 8
- Added a function 'bli_zen4_override_gemm_blkszs' to override
blocksizes and kernels to be used for SUP for supported storage
schemes.
- Updated the zen4 config to enable these kernels in zen4 path.
- Thresholds are yet to be derived.
- Updated CMakeLists.txt with DGEMM SUP kernels for windows build.
Kernel-specific details:
- K-loop is unrolled by 8 times to facilitate prefetch of B.
- For every load of one column of A, the corresponding column in
next panel of A is prefetched with T1 hint.
- One column of C is prefetched with T0 hint per iteration of LOOP2.
- TAIL_NITER is derived to be 3.
- For every unroll of k-loop, one row of B is prefetched with T0 hint.
- C-prefetching for row-storage is yet to be added.
- B-prefetching for col-storage is yet to be added.
- Support for C transpose is yet to added.
AMD-Internal: [CPUPL-2755], [CPUPL-2409]
Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.
AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
- RBP is base pointer which points to base of current stack frame.
ASAN tool rely on rbp and rsp for stack related validations. So over-writting
or modifying RBP register results in application termination with the error code
of stack overflow.
- Removed all the code snippets which were using rbp register for prefetching matrices
and sometimes loading elements from memory in all of the gemm sup kernels for double
datatype.
- Removed reference to rbp from register clobber list as well to completely avoid the
usage of rbp register.
AMD-Internal: [CPUPL-2613, CPUPL-2587]
Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28
- Enabled DTRSM small mt for sizes where performance is better
than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
to bli_trsm_small so that one function pointer can be used for
all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
removed so that the sizes for which small path to take can be
decided on bla layer instead of inside kernel.
AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
- Added DGEMM and DTRSM row preferred micro kernels.
- DTRSM left lower and left upper micro kernels are added.
- DGEMM kernel is optimized for both row stored C and col
stored C.
AMD-Internal: [CPUPL-2745]
Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180
- In cases when incy != 1, a buffer is created for y vector. The
contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
based architecture, AVX2 kernels are invoked. For other, the
kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.
AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
- The new AMAXV adheres to the BLAS definition of ISAMAX by not handle
NaN separately. In the previous kernel, NaN is considered the smallest
element of all the elements in the array.
- The new logic uses two helper functions - bli_vec_absmax_double and
bli_vec_search_double.
- bli_vec_absmax_double finds the absolute largest element and the index
range in which the first occurence of this element can be found.
- bli_vec_search_double returns the index of the first occurence of the
absolute value of an element.
- AMAXV uses these two helper functions to find the absolute largest
element and then searches using bli_vec_search_double in the reduced
range provided by bli_vec_absmax_double.
- Added condition check for n == 1 in BLAS layer. It is an optimization
mention in the BLAS standard API definition.
- Removed redundant n == 0 condition check from the kernel. This is a
BLAS exception and is already done in the BLAS layer.
- Removed AVX2 flag check from the BLAS layer. Kernels will be picked
based on the architecture ID in the new design.
AMD-Internal: [CPUPL-2773]
Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.
AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
error function - ERF
2. Added error function based GeLU activation post-ops for the
S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
addition to adding the marco based function definations of the
ERF function in the math-utils and gelu headerfiles.
AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be
- Extended the existing support for handling beta scaling
in the fringe cases of 3x4 RD kernels in ZGEMM SUP. The
added support ensures that NaN values initialized in C do
not propogate to the result when beta is 0.
- The support has been added to fringe cases common to, as well
as specific to the m and n variants of the RD kernels.
AMD-Internal: [CPUPL-3053] [SWLCSG-1900]
Change-Id: I8e617ac505144c3ea3a70556413d264f11dfc9a9
- Since the definition of negative increments is different between BLAS and BLIS,
there was bug on how the memory was accessed when we were copying the elements
of a vector with negative increments. Updated the code under the assumption that
when negative increments are set, the vector is being accessed starting from the end.
For the BLAS interface, there is an intermediate conversion before calling into the blis layer.
Change-Id: I08343472b418733fad6f7add9e90aa96cdf68285
AMD-Internal: [SWLCSG-1900]
- Increased unroll by reusing X registers that was previously used
for performing shuffle.
- Added loops with smaller increment steps for better problem
decomposition.
- Added X vector and Y vector prefetch to the kernel.
- Removed redundant code that handles fringe in incx = 1 and
incy = 1. This remainder will be performed by the loop that handles
non-unit stride cases.
- Vectorized loops that handle non-unit stride cases using SSE
instructions.
AMD-Internal: [CPUPL-2773]
Change-Id: Ifb5dc128e17b4e21315789bfaa147e3a7ec976f0
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.
AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero
or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.
AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
- Bias add, relu, parametric relu and gelu post-ops support added in all
f32 gemm micro-kernels. These post-ops are implemented for both AVX512
and AVX2 ISA based on the micro-kernel flavor. The support is added for
both row and column major cases.
- Lpgemm bench updates to support f32 post-ops.
AMD-Internal: [CPUPL-3032]
Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476
- Added k-loop unrolling by a factor of 4 to the following SGEMM
SUP RV kernels:
- 5x48, 5x32, 5x16
- 4x64, 4x48, 3x32, 4x16
- 3x48, 3x32, 3x16
- 2x64, 2x48, 2x32, 2x16
- 1x64, 1x48, 1x32, 1x16
- 6x64n, 5x64n, 3x64n, 2x64n, 1x64n
- Removed unused variables which were resulting in warnings during
compilation.
- Added a newline at the end of header files to resolve warnings
shown during compilation.
AMD-Internal: [CPUPL-3002]
Change-Id: Iab6cf329f6d7fbd7544b5c8837e493069e8c9921
-Currently lpgemm can only be built using either zen3 or zen4 config.
The lpgemm kernel code is re-structured to support amdzen, and thus
multi machine deployment.
-The micro-kernel calls (gemm and pack) are currently hardcoded in the
lpgemm framework. This is removed and a new lpgemm_cntx based dispatch
mechanism is designed to support runtime configurability for
micro-kernels.
AMD-Internal: [CPUPL-2965]
Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef
- Removed repetitive function declaration of GEMM small kernels
from the C files
- Function declaration of these kernels exist in the header files
where the kernels are supposed to be declared.
AMD-Internal: [CPUPL-3003]
Change-Id: Ic10e66691c0742ce519bcc3fe4a12ec5c5052b63
- Currently the pointer received as function argument is
used for packing which causes only a partial copy of
input buffer to output buffer due to strange optimizations
by compiler.
- To fix this, instead of using a normal pointer for output
buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
buffer, the mysterious problem of incomplete copy has
been solved.
Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646