Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format
- New edge kernels that uses masked load-store
instructions for handling corner cases.
- Mask load-store instruction macros are added.
vmaskmovps, VMASKMOVPS for masked load-store.
- It improves performance by reducing branching overhead
and by being more cache friendly.
- Mask load-store is added only for row storage format
AMD-Internal: [CPUPL-4041]
Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.
- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.
AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
a layer higher.
- Added a scalar loop to handle compute in case of non-unit strides.
This loop ensures functionality in case packing fails at the
framework level.
AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
- Updated the final reduction of partial sums( AVX-2 code section )
to use scalar accumulation entirely, instead of using the
_mm256_hadd_pd( ... ) intrinsic. This will in turn change the
associativity in the reduction step.
- Reverted to using scalar code on the fringe cases in AVX-2 kernel
for DNRM2 and DZNRM2, for improving functional correctness.
AMD-Internal: [CPUPL-4049]
Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.
- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.
- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.
- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.
AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
present in the bli_gemv_unf_var2_amd.c file, as part of the
bli_sgemv_unf_var2( ... ) function. This was changed to
bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
from 6 to 5 ), in accordance to the function pointer present
in the zen3 and zen4 context files.
- Changed the accumulator type to double from float, inside the
fringe loop for unit-strides(vectorized path) and non-unit strides
(scalar code).
AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
-The zero point data type is different based on the downscale data
type. For int8_t downscale type, zero point type is int8_t whereas for
uint8_t downscale type, it is uint8_t. During downscale post-op, the
micro-kernels upscales the zero point from its data type (int8_t or
uint8_t) to that of the accumulation data type and then performs the
zero point addition. The accumulated output is then stored as downscaled
type in a later storage phase. For the <u|s>8s8s16 micro-kernels, the
upscaling to int16_t (accumulation type) is always performed assuming
the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction.
However this will result in incorrect upscaled zero point values if the
downscale type is uint8_t and the associated zero point type is also
uint8_t. This issue is corrected by switching between the correct
upscale instruction based on the zero point type.
AMD-Internal: [SWLCSG-2500]
Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076
Description:
1. Updated ERF function threshold from 3.91920590400 to 3.553
to match with the reference erf float implementation which
reduced errors a the borders and also clipped the output
to 1.0
2. Updated packa function call with pack function ptr in bf16
api to avoid compilation issues for non avx512bf16 archs
3. Updated lpgemm bench
[AMD-Internal: SWLCSG-2423 ]
Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.
- Corrected A and B matrix buffer offsets, for cases where k > KC.
AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
- Updated the bli_dnormfv_unb_var1( ... ) and
bli_znormfv_unb_var1( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Added the logic to distribute the job among the threads such
that only one thread has to deal with fringe case(if required).
The remaining threads will execute only the AVX-2 code section
of the computational kernel.
- Added reduction logic post parallel region, to handle overflow
and/or underflow conditions as per the mandate. The reduction
for both the APIs involve calling the vectorized kernel of
dnormfv operation.
- Added changes to the kernel to have the scaling factors and
thresholds prebroadcasted onto the registers, instead of
broadcasting every time on a need basis.
- Non-unit stride cases are packed to be redirected to the
vectorized implementation. In case the packing fails, the
input is handled by the fringe case loop in the kernel.
- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
cases of size = 2 ( and ) size = 1 or non-unit strides respectively.
AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.
- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with very
small dimensions.
- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.
- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.
-blis_dgemm_tiny function supports single threaded computation as of now.
AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
- Added an explicit function definition for ZGEMV var 1. This
removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
instead of querying it.
- With this change fringe loop is vectorized using SSE
instructions.
AMD-Internal:[CPUPL-3997]
Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.
AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
logic. The kernel performs two different sets of compute based
on the value of alpha, for both unit and non-unit strides. There
are no constraints on beta scaling of the 'y' vector.
- Updated the logic to support 'x' conjugate in the computation.
The kernel supports conjugate/no conjugate operation through the
usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
intrinsics.
- Updated the early return condition in the kernel to adhere to
the standard compliance.
- Updated the scalar computation with vector computation(using 128
bit registers), in case of dealing with a single element(fringe case)
in unit-stride or vectors with non-unit strides. A single dcomplex
element occupies 128 bits in memory, thereby providing scope for
this optimization.
- Added accuracy and extreme value testing with sufficient sizes
and initializations, to test the required main and fringe cases
of the computation.
AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.
Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.
AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.
AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- In GEMV variant 1, the input matrix A is in row major. X vector
has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
operation inside the kernel has been ensured by packing the input X
vector to a temporary buffer with unit stride. Currently, the
packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
- In variant 2 of GEMV, A matrix is in column major. Y vector has
to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
GEMV operation inside the kernel has been ensured by packing the
input Y vector to a temporary buffer with unit stride. As part of
the packing Y is scaled by beta to reduce the number of times Y
vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
buffer are copied to the original buffer and the temporary one is
released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
kernels will be picked from the context for non-avx machines. For
avx machines, the kernel(s) to be dispatched is(are) assigned to
the function pointer in the unf_var layer.
AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
ZGEMM. The kernel is built for handling the GEMM computation
with inputs having k = 1, and the transpose values for A and
B as N.
- The kernel dimension has been changed from 4x6 to 4x4,
due to the following reasons :
- The 1xNR block of B in the n-loop can be reused over multiple
MRx1 blocks of A in the m-loop during computation. Similar
analogy exists for the fringe cases.
- Every 1xNR block of B was scaled with alpha and stored in
registers before traversing in the m-dimension. Similar change
was done for fringe cases in n-dimension.
- These registers should not be modified during compute, hence
the kernel dimension was changed from 4x6 to 4x4.
- The check for early exit(with regards to BLAS mandate) has been
removed, since it is already present in the BLAS layer.
- The check for parallel ZGEMM has been moved post the redirection to
this kernel, since the kernel is single-threaded.
- The bli_kernels_zen.h file was updated with the new kernel signature.
AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
instructions.
- Removed the negative incx handling checks from the blis_impli layer
of ZDSCAL as BLAS expects early return for incx <= 0.
AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.
AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.
AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().
Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.
AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
- In C/Z TRSM small, packing in case of unit diagonal
is not handled properly.
- Diagonal elements are still being read even in case of
unit diagonal.
- This causes "Conditional jump or move depends on
uninitialised value" error during valgrind tests.
- To fix this, diagonal elements should not be read
in case of unit diagonal.
AMD-Internal: [CPUPL-3406]
Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e
- Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags.
- Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc.
- This is a quick solution to enable runs on non-AVX2 machines.
AMD-Internal: [CPUPL-3241]
Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.
AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.
AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. Custom Clip is an element-wise post-op which is used to
clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
to enable custom clipping for various LPGEMM types and SGEMM
AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6
- Set the variables to zero to avoid the compiler warning
(-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
bli_trsm_small_AVX512.c
- Changed the datatype from dim_t to siz_t for i,k,j
in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
avoid the compiler warning (-Waggressive-loop-optimizations)
AMD-Internal: [CPUPL-2870]
Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
Threading related changes
--------------------------
- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
DDOTV. This function contains the AOCL dynamic logic for the
respective kernels.
- Added handling for cases when number of elements (n) is less than
number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
amount of work the calling thread is supposed to perform on a
vector.
Interface changes
-----------------
- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
and only the real part of the alpha passed is used inside the kernel.
The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
Without this multithreaded code might crash because of minimum work
calculation.
Misc
-----
- Removed unused variables from ZSCAL2V and AXPYV kernels.
AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
Modify code to correct some warning messages from GCC 12.2 or
AOCC 4.0:
- Increase size of nbuf in blastest/f2c/endfile.c
- Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c
and kernels/zen/1/bli_axpyv_zen_int10.c
- Remove extraneous parentheses in frame/compat/bla_trsm_amd.c
and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c
- Add __attribute__ ((unused)) to several variables in
frame/1m/packm/bli_packm_struc_cxk.c and
frame/1m/packm/bli_packm_struc_cxk_md.c
AMD-Internal: [CPUPL-2870]
Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
6861fcae91
AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
- Complex AXPBY kernels gave incorrect output when both alpha and
beta had non-zero imaginary parts.
- Previously, the scalar code (used to calculate remainder result
or non-unit increment cases) was directly accessing and updating
the y-vector pointer thus, resulting in an incorrect output.
Updated it to operate on a local copy of the currect y element
and store the final result to the y-pointer.
- Also, added operation to store temporary calculation of alpha*x
in an intermediate vector and then later added to the y vector.
AMD-Internal: [CPUPL-3037]
Change-Id: Iddbd3000dcb1505b444b0ad41ab881b055842e1c
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.
AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
- Enabled DTRSM small mt for sizes where performance is better
than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
to bli_trsm_small so that one function pointer can be used for
all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
removed so that the sizes for which small path to take can be
decided on bla layer instead of inside kernel.
AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
- In cases when incy != 1, a buffer is created for y vector. The
contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
based architecture, AVX2 kernels are invoked. For other, the
kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.
AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
- The new AMAXV adheres to the BLAS definition of ISAMAX by not handle
NaN separately. In the previous kernel, NaN is considered the smallest
element of all the elements in the array.
- The new logic uses two helper functions - bli_vec_absmax_double and
bli_vec_search_double.
- bli_vec_absmax_double finds the absolute largest element and the index
range in which the first occurence of this element can be found.
- bli_vec_search_double returns the index of the first occurence of the
absolute value of an element.
- AMAXV uses these two helper functions to find the absolute largest
element and then searches using bli_vec_search_double in the reduced
range provided by bli_vec_absmax_double.
- Added condition check for n == 1 in BLAS layer. It is an optimization
mention in the BLAS standard API definition.
- Removed redundant n == 0 condition check from the kernel. This is a
BLAS exception and is already done in the BLAS layer.
- Removed AVX2 flag check from the BLAS layer. Kernels will be picked
based on the architecture ID in the new design.
AMD-Internal: [CPUPL-2773]
Change-Id: Ida2dae84a60742e632dc810ab1b7b80fc354e178
1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the
error function - ERF
2. Added error function based GeLU activation post-ops for the
S32, S16 and BF16 (LPGEMM) and SGEMM APIs.
3. Changes for this includes frame and micro-kernel level changes in
addition to adding the marco based function definations of the
ERF function in the math-utils and gelu headerfiles.
AMD-Internal: [CPUPL-3036]
Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be