Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
in config/zen4/make_defs.mk to enable BLIS to be build with
LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
one (zen->zen2->zen3->zen4), when they should be independent
of each other. Correct by including config/zen/amd_config.mk
in all zen make_defs.mk files to reinitialize the compiler
flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
this is already specified in amd_config.mk (and should
be the default setting anyway).
* Tidy files to simplify nested if structures and be more
consistent with one another.
AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().
Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.
AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
GCC-11 and below support AVX512-BF16.
However, it doesn't support all the bf16 instructions required.
For bf16 downscale APIs, when beta scaling is done, C output
elements must be upscaled from BF16 type to Float type for
beta scaling operation.
For this upscaling operation of bf16 to float,
_mm512_cvtpbh_ps is used.
This however is not supported by GCC-11 and below
(but is supported on GCC 12 onwards)
Lack of this instruction support in gcc11, and below leads to
compilation issues with this instruction (_mm512_cvtpbh_ps)
not being recognized.
To fix, this, we use a set of instructions:
1. register containing bf16 type
__m256bh a1
2. Convert bf16 to float with shift left ops
__m512 float_a1 = (__m512)
(_mm512_sllv_epi32
(_mm512_cvtepi16_epi32 ((__m256i) a1), _mm512_set1_epi32 (16)));
AMD-Internal: [CPUPL-3454]
Change-Id: Ie4a9f04881c59ced088608633774b27f22b4ab8e
This change contains the following:
1. Downscale optimization fix
a. Similar to downscale optimizations made for s32 and s16 gemm,
the following optimizations are done to improve the downscale
performance for BF16 gemm
b. The store to temporary float buffer can be avoided when k < KC
since intermediate accumulation will not be required for the
pc loop (only 1 iteration). The downscaled values (bf16) are
written directly to the output C matrix.
c. Within the micro-kernel when beta != 0, the bf16 data from the
original C output matrix is loaded to a register, converted to
float and beta scaling is applied on it at register level.
This eliminates the requirement of previous design of copying the
bf16 value to the temporary float buffer inside jc loop.
2. Alpha scaling
a. Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in
bf16 micro-kernels.
b. Alpha scaling is now only done when alpha != 1.
3. K Fringe optimization
a. Previously memcpy was used for K fringe case to load elements
from A matrix in the microkernels
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs without the need to use
memcpy functions
4. N LT-16 fringe optimization
a. Previously memcpy was used for N LT 16 fringe case in the
microkernelsfor storing the downscaled and non-downscaled output.
b. Now, masked stores are used to store the downscaled and
non-downscaled outputs of BF16 without the need to use
memcpy functions
5. Framework updates to avoid unnecessary pack buffer allocation
a. The default allocation of the temporary pack buffer is removed
and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044
- In C/Z TRSM small, packing in case of unit diagonal
is not handled properly.
- Diagonal elements are still being read even in case of
unit diagonal.
- This causes "Conditional jump or move depends on
uninitialised value" error during valgrind tests.
- To fix this, diagonal elements should not be read
in case of unit diagonal.
AMD-Internal: [CPUPL-3406]
Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e
- Missing break statement will result in unexpected control flow.
This function will not launch the threads for the API in question
according to the AOCL dynamic logic without the break statement.
AMD-Internal: [CPUPL-3436]
Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698
-Currently when any of the downscale API is called, a temporary pack
buffer is allocated (with bli_membrk_acquire_m) by each thread. It is
used to persist intermediate higher precision output accumulated by the
micro-kernel across pc loop when the number of pc iterations is more
than 1 (k > KC). The bli_membrk_acquire_m is a thread safe operation and
uses locks (pthread_mutex) to ensure thread safe checkout of memory/
block from the memory pool.
-However when k < KC, this temporary buffer is not required. But since
this pack buffer is allocated by default in downscale API, the overhead
from locks affects performance when k < KC, m or n is sufficiently small
and the number of threads involved is high. This default allocation is
removed and the pack buffer is now only allocated if k > KC.
AMD-Internal: [CPUPL-3430]
Change-Id: I492586ff4c47bc7480d364efb7af3674e31bd2c1
- AVX2 and AVX512 flags are set up locally for each object library that requires them.
- Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally.
- To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library.
AMD-Internal: [CPUPL-3241]
Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73
Previous commit (30b931ae60) is having incorrect ticket id.
Correct ticket id for that commit is
AMD-Internal:[CPUPL-3328]
Change-Id: If3242714984ae3d3d9bbb0198bda91b4dd9a4bdc
- Since the code used whitespace variant of AVX512 mask instruction. But some compilers
accept whitespace variant and some don't - to be safe, we removed whitespace.
- Whitespace variant of masked instruction "vmovupd (%rax,%r8,1),%zmm8{%k2} {z}" is replaced with
this instruction "vmovupd (%rax,%r8,1),%zmm8{%k2}{z}" to resolve the compilation failure issue.
- Thanks to Shubham Sharma<shubham.sharma3@amd.com> for identifying issue.
AMD-Internal: [CPUPL-1963]
Change-Id: I290589132e8cce25cab0d1e4c195a7dd0a014937
-Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were
introduced in later versions of gcc (>10) in addition to already
existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to
support compilation using gcc 9.4, either the masked intrinsic or other
gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512)
in LPGEMM Zen4 micro-kernels.
-Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32)
needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc
9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels
for BF16 LPGEMM is excluded from compilation based on GNUC macro.
AMD-Internal: [CPUPL-3396]
Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7
Few tests failed on windows OS as some registers were not added as part
of cobbler list
Updated below registers into clobber list:
In function bli_zpackm_zen4_asm_12xk : ZMM12-ZMM15
In function bli_zpackm_zen4_asm_4xk : ZMM4-ZMM7
AMD-Internal: [CPUPL-3253]
Change-Id: I3e42130bf1a3b48717c4b437179ae3f116e5cf1d
- In the bli_x86_asm_macros.h file, the set of vinsertf?x? and
vextractf?x? instructions are facing macro expansion errors due to
ambiguous macro redirection. The lower-case macro definitions of
these instructions are not properly redirected to their corresponding
upper-case macro definitions.
- This error occurs due to ambiguity in the upper-case macro name.
At the place of lower-case macro definition, the redirection is to
macros of the form VINSERTF?x? and VEXTRACTF?x?, while at the place
of upper-case macro definition, they are of the form VINSERTF?X? and
VEXTRACTF?X?. This causes a mismatch of the upper-case macro due to
different case sensitive 'x' being used.
- This patch corrects this issue, by changing the lower-case 'x' to
upper-case, among the upper case macros at the place of redirection.
This provides uniformity and facilitates the expected macro-expansion.
AMD-Internal: [CPUPL-3276]
Change-Id: Id1f45f8e4bb083cd4b87632b713ff6baba616ff2
- When the number of threads launched is not equal to the
number of threads requested the garbage value in the created
buffer will not be overwritten by valid values.
- To handle the above scenario, the created temporary buffer is
initialized with zeroes.
AMD-Internal: [CPUPL-3268]
Change-Id: I439a1da18eb1b380491fea14f42b0ede05ccf5a9
- Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags.
- Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc.
- This is a quick solution to enable runs on non-AVX2 machines.
AMD-Internal: [CPUPL-3241]
Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650
- Partial completion of compute was happening since BLIS was unable
to launch the required number of threads. This was because rntm
was returning a thread count greater than the maximum number of
threads that can be launched in the subsequent parallel region.
- Added 'omp_get_num_threads' inside the parallel regions to get the
actual number of threads spawned. The work distribution happens
based on the actual number of threads launched in that region.
AMD-Internal: [CPUPL-3268]
Change-Id: I086ad4b9b644f966b7bab439e43222396f0c2bf0
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.
AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
- In Zen 4 context, there was a mismatch between the fuse factor
initialized in the block size parameter and fuse factor of the
corresponding kernel initialized.
AMD-Internal: [SWLCSG-2051]
Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
sum of every column of B matrix to implement the s8s8s16 operation
using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
Here the A matrix elements are converted from int8 to uint8 (s16 FMA
works with A matrix type uint8 only) by adding extra 128 to
every A matrix element
6. Post GEMM computation, additional operations are performed on the
accumulated outputs to get the correct results.
Final C = C - ( (sum of column of B matrix) * 128 )
This is done to compensate for the addition of extra 128 to every
A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.
AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
redirect few Native inputs to the SUP path based on the fact that in a
multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
subsequently.
AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
Details:
- Added Doxyfile, a configuration file in docs directory for generating Doxygen document from source files.
- Currently only CBLAS interface of (Batched gemm and gemmt)extension APIs are included.
- Support for BLAS interface is yet to be added.
- To generate Doxygen based document for extension API, use given command.
$ doxygen docs/Doxyfile
AMD-Internal: [CPUPL-3188]
Change-Id: I76e70b08f0114a528e86514bcb01d666acc591e8
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
architectures, but function bli_check_valid_model_id() should
check the provided model_id against the suitable range within
the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
checking of a user specified value of BLIS_ARCH_TYPE against
the enabled configurations is delayed to a separate function,
bli_arch_check_id().
- Default selection based on hardware can be overridden using the
BLIS_MODEL_TYPE environment variable. Valid values are:
Genoa, Bergamo, Genoa-X, Milan, Milan-X
Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
but will result in the default option for that architecture being
selected. This is different to specifying an incorrect value of
BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
the --rename-blis-model-type argument to configure (or cmake
equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
--rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
currently only for AMD cpus. Functions are provided to query
these from other parts of the code, namely:
uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()
AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
for gemmt/syrk because near-to-square shaped kernel performs
better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.
AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
-Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/
gelu_erf) is supported in the post-op struct along with bias or
downscale. This setup was sufficient when only activation functions
were supported as eltwise post-ops. But with the introduction of clip
post-op(a type of non-activation eltwise operation), it has become
necessary to extend the post-ops framework to support multiple eltwise
operations, with the multiple eltwise often used in the form activation
eltwise op + non-activation eltwise ops. The aocl post-op struct is
modified and the post-op parser is updated to support this use case.
-The lpgemm_bench is updated to support testing/benchmarking of the
multiple eltwise operations use case. The function for accuracy checking
is modified to support correctness testing irrespective of the order and
count of post-ops. Additionally the help message is updated so as to
better describe the capabilities of lpgemm_bench.
AMD-Internal: [CPUPL-3244]
Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f
Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the
race condition and suggesting the changes to fix the same
Existing Design:
- AOCL progress callback pointer is a global pointer which is shared
across all threads
Existing Design challenges:
- The callback function cannot safely disable the progress mechanism,
as another thread may have already checked to see if the function
pointer is set, and then re-reads the pointer upon invocation of
the callback. If one thread sets the callback to NULL in this time,
then the resulting thread will attempt to call the null pointer as a
function pointer, leading to a segfault.
New Design :
- Each thread maintains a local copy of progress pointer
AMD-Internal: [SWLCSG-1971]
Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4
Details:
- Added logic to display CMAKE_BUILD_TYPE while configuring
through cmake gui.
- Added logic to set values for BLIS_ENABLE_JRIR_SLAB,
BLIS_ENABLE_JRIR_RR mutually exclusive variables.
AMD-Internal: [SWLCSG-2041, SWLCSG-2042]
Change-Id: I81c96a9941418a0810d554ddc89056ca8420b064
Corrections for spelling and other mistakes in code comments
and doc files.
AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
-Similar to downscale optimizations made for u8s8s32 gemm, the following
optimizations are made to improve the downscale performance for u8s8s16
gemm:
a. The store to temporary s16 buffer can be avoided when k < KC since
intermediate accumulation will not required for the pc loop (only 1
iteration). The downscaled values (s8) are written directly to the
output C matrix.
b. Within the micro-kernel when beta != 0, the s8 data from the original
C output matrix is loaded to a register, converted to s16 and beta
scaling applied on it. The previous design of copying the s8 value to
the s16 temporary buffer inside jc loop and using the same in beta
scaling is removed.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s16
micro-kernels. Alpha scaling is now only done when alpha != 1.
AMD-Internal: [CPUPL-3237]
Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef
- Adding doc regarding option setting for INT64 in README.
- Bugfix on template instantiation on helper function. Updated to use gtint_t instead of int.
AMD-Internal: [CPUPL-2732]
Change-Id: Ia52407a1ef3fdd06e905c2e3d4aa5befb80e82d6
1. Custom Clip is a post-op which is used to clip the
accumulated GEMM output within a certain range.
2. This post-op is implemented for u8s8s32os32/os8 and
s8s8s32os32/os8 LPGEMM types.
3. Changes are done at the microkernel level for these
2 APIs to support Clip Post-Op
AMD-Internal: [CPUPL-3207]
Change-Id: I8b4da5807de6a93711b0ae9343970c55192f75d4
Correct argument alpha in call to ZDSCAL kernel function in serial
code path. This resolves numerous instances of incorrect results
in ACML LAPACK test programs when BLIS_ARCH_TYPE=generic.
AMD-Internal: [CPUPL-3227]
Change-Id: Ibf5ee79392e80c2d93a0d336a7b0e2568e149f94
- In level-1 kernels, with multi-threading enabled, only the partial
job was getting executed.
- The bug was in bli_thread_vector_partition and occurred only
when minimum work for a thread >= 1 i.e., when the number of threads
launched is less than number of elements and the number of elements
is not a multiple of the number of threads launched.
AMD-Internal: [CPUPL-3231]
Change-Id: Ie20abb93468282cd6ac2372267714fb80c26d7cc
- Added AVX512 function's to the BLAS layer of daxpy, dscal and
ddot.
- Added BLAS exceptions for incx <= 0 to DSCALV
- Added BLIS_KERNELS_ZEN4 macro check to guard AVX512 kernels
as they will not be available in other contexts.
AMD-Internal: [CPUPL-2766][CPUPL-2765][CPUPL-2793][CPUPL-2800]
Change-Id: I68860c2ff6b65624907cc1b590173f0e909bd271
- In gemmt and normf, #ifdef BLIS_KERNELS_* is added
to make sure only compiled kernels are used.
- In bal_copy and bla_swap, missing '\' is added.
AMD-Internal: [CPUPL-2870]
Change-Id: I83452dff761f60db6957f557321ce210ab72c037
Details:
- Added a new function for choosing between SUP and
native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
- one dimension is small
- Two dimensions are small
- All dimensions are small
- Added different threshold conditions for each of the
categories.
AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
-Currently in aocl_gemm, gelu (both tanh and erf based) computation is
only supported as a post-op as part of low precision gemm api call (done
at micro-kernel level). However gelu computation alone without gemm is
required in certain cases for users of aocl_gemm.
-In order to support this, two new api's - aocl_gelu_tanh_f32 and
aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's
computes element-wise gelu_tanh and gelu_erf respectively of a matrix/
vector of floats. Both the api's invokes ISA specific vectorized micro-
kernels (vectorized only when incx=1), and a cntx based mechanism
(similar to lpgemm_cntx) is used to dispatch to the appropriate kernel.
AMD-Internal: [CPUPL-3218]
Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704
1. Custom Clip is an element-wise post-op which is used to
clip the accumulated GEMM output within a certain range.
2. The Clip Post-Op is used in downscaled and non-downscaled
LPGEMM APIs and SGEMM.
3. Changes are done at frame and microkernel level to implement
this post-op.
4. Different versions are implemented - AVX-512, AVX-2, SSE-2
to enable custom clipping for various LPGEMM types and SGEMM
AMD-Internal: [CPUPL-3207]
Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6
- Set the variables to zero to avoid the compiler warning
(-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
bli_trsm_small_AVX512.c
- Changed the datatype from dim_t to siz_t for i,k,j
in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
avoid the compiler warning (-Waggressive-loop-optimizations)
AMD-Internal: [CPUPL-2870]
Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
Updated the CmakeLists.txt to check whether the specified
libraries are present or abort cmake building
AMD-Internal: [CPUPL-2732]
Change-Id: I90115217c228430095aa53a82dc26d16935b320f
Threading related changes
--------------------------
- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
DDOTV. This function contains the AOCL dynamic logic for the
respective kernels.
- Added handling for cases when number of elements (n) is less than
number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
amount of work the calling thread is supposed to perform on a
vector.
Interface changes
-----------------
- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
and only the real part of the alpha passed is used inside the kernel.
The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
Without this multithreaded code might crash because of minimum work
calculation.
Misc
-----
- Removed unused variables from ZSCAL2V and AXPYV kernels.
AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
- Added AVX512 based double and float AXPYV which will be used in
Zen4 context.
- Added n <= 0 check and alpha == 0 check to the BLAS layer of
SAXPY.
- Modified BLAS framework of float AXPYV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2793]
Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1
- Following sequence is followed for getting number of threads
for given input.
- Divides total range of input into 3 category(m < n, m > n, m = n)
- For each range it is further divided into 4 sub category.(K <= 32,
K <= 64, K <=128, K > 128)
- As per the input range, number of threads is being decided.
AMD-Internal: [CPUPL-2966]
Change-Id: I0b04e9de1615e87acb189b228544afac74664f02
-As part of an earlier optimization, the memcpy function call in k
fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe
(n < 16 - beta scale and C store) were replaced with copy macros
specifically optimized for less than 4 and 16 elements each. However
upon further analysis it was observed that masked load/broadcast and
masked store performed better on average than the copy macros. The copy
macros contained more if conditions, which resulted in more branching
and thus resulting in perf variations. It was also noted that code
generation varied a lot based on the compilers when using the copy
macros due to the extra conditional code.
-As part of this change, the copy macros are completely replaced with
masked load/broadcast/store. Performance was observed to be better and
less prone to variations for the k fringe and n fringe (< 16) cases.
AMD-Internal: [CPUPL-3173]
Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503