- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
thresholds and modifications to build system.
Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
enabled.
Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.
AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.
AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
1. Two CGEMM function pointers are added for different storage schemes
1. bli_cgemmsup_rv_zen_asm_3x8m
2. bli_cgemmsup_rv_zen_asm_3x8n
2. In previous commit:
(Level-3 triangular routines now use different block sizes and kernels
Commit Id: 79e174ff0a)
1. bli_cntx_set_l3_sup_tri_kers cntx function was created
2. Function holds optimised function pointers for GEMMT/SYRK API's
3. It avoids over riding default block sizes which improves the
performance
4. This function did not include optimised CGEMM function pointers
leading to regression as reference kernels were invoked
3. With this commit, 2 optimized CGEMM function pointers are added in
bli_cntx_set_l3_sup_tri_kers
1. This fixes the regression as optimized CGEMM functions are invoked
AMD-Internal: [CPUPL-3831] [CPUPL-3830]
Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
Details:
- Eliminated the need for override function in SUP for GEMMT/SYRK.
- New set of block sizes, kernels and kernel preferences
are added to cntx data structure for level-3 triangular routines.
- Added supporting functions to set and get the above parameters from cntx.
- Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
In case they are not set, use the default block sizes/kernels of
Level-3 SUP.
AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
in config/zen4/make_defs.mk to enable BLIS to be build with
LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
one (zen->zen2->zen3->zen4), when they should be independent
of each other. Correct by including config/zen/amd_config.mk
in all zen make_defs.mk files to reinitialize the compiler
flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
this is already specified in amd_config.mk (and should
be the default setting anyway).
* Tidy files to simplify nested if structures and be more
consistent with one another.
AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
- In Zen 4 context, there was a mismatch between the fuse factor
initialized in the block size parameter and fuse factor of the
corresponding kernel initialized.
AMD-Internal: [SWLCSG-2051]
Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
redirect few Native inputs to the SUP path based on the fact that in a
multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
subsequently.
AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
architectures, but function bli_check_valid_model_id() should
check the provided model_id against the suitable range within
the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
checking of a user specified value of BLIS_ARCH_TYPE against
the enabled configurations is delayed to a separate function,
bli_arch_check_id().
- Default selection based on hardware can be overridden using the
BLIS_MODEL_TYPE environment variable. Valid values are:
Genoa, Bergamo, Genoa-X, Milan, Milan-X
Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
but will result in the default option for that architecture being
selected. This is different to specifying an incorrect value of
BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
the --rename-blis-model-type argument to configure (or cmake
equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
--rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
currently only for AMD cpus. Functions are provided to query
these from other parts of the code, namely:
uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()
AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
for gemmt/syrk because near-to-square shaped kernel performs
better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.
AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
Details:
- Added a new function for choosing between SUP and
native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
- one dimension is small
- Two dimensions are small
- All dimensions are small
- Added different threshold conditions for each of the
categories.
AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
- Added AVX512 based double and float AXPYV which will be used in
Zen4 context.
- Added n <= 0 check and alpha == 0 check to the BLAS layer of
SAXPY.
- Modified BLAS framework of float AXPYV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2793]
Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1
- Added AVX512 based double and float DOTV which will be used in
Zen4 context.
- Added n <= 0 check to the BLAS layer of SDOTV.
- Modified BLAS framework of float DOTV to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2800]
Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702
- Added AVX512 based double and float SCALV which will be used in
Zen4 context.
- Added incx <= 0 check and alpha == 1 check to the BLAS layer of
SSCAL.
- Modified BLAS framework of float SCAL to remove flag check and
pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
BLIS_KERNELS_ZEN4 macro.
AMD-Internal: [CPUPL-2766],[CPUPL-2765]
Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc
- Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for
SGEMM. This can be updated once GEMMT specific optimization are added
for AVX-512.
- Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override
blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for
SGEMM operation.
AMD-Internal: [CPUPL-3060]
Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a
- Implemented 12x4m column preferential SUP kernels(main and fringe
cases). The main kernel dimension is 12x4, and the associated fringe
kernel dimensions are : 12x3m, 12x2m, 12x1m
8x4, 8x3, 8x2, 8x1
4x4, 4x3, 4x2, 4x1
2x4, 2x3, 2x2, 2x1.
- Included in-register transposition support for C, thus extending
the storage scheme supports to CCC, CCR, RCC and RCR inside the
milli-kernel.
- Integrated conditional packing of A onto the SUP front end for
dcomplex datatype. This redirects RRC and CRC storage schemes
onto the preceding set of SUP kernels through storage scheme
transformation(RCC and CCC respectively).
- Updated the zen4 context file with the new set of SUP kernels, to
get enabled appropriately. Furthermore, the context file was updated
with the AVX-2 dotxv signatures for dcomplex datatype. This redirects
the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines.
- Added C prefetching onto L2-cache, and an unroll factor of 4 for the
k loop in all the kernels.
- Work in progress to include conjugate support and input spectrum
extension for the AVX-512 SUP kernels. The current thresholds in zen4
context is the same as that of the zen3 thresholds for ZGEMM SUP.
AMD-Internal: [CPUPL-3122]
Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e
-The n fringe micro kernels uses only a few zmm registers for computing
the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24
used in 6x64). This results in lot of wasted registers that if utilized
can help increase the MR dimension and thus improve the reuse of
registers loaded with B. Based on this concept, the existing n fringe
kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted
that the maximum number of registers are not used, since it results in
cache inefficient code due to the increase in MR and thus more
broadcasts required from unpacked A matrix.
-Compiler flag updates for AOCC build to generate loops with 64 byte
alignment. This has been observed to improve performance slightly when
k dimension is small.
AMD-Internal: [CPUPL-3173]
Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca
- Kernel block size is 12x4
- Updated the zen4 config to enable these kernels in zen4 path.
- Tuned MC,KC,NC for better performance for m/n/k size > 500
- Updated CMakeLists.txt with ZGEMM kernels for windows build.
Kernel supports:
1. Preload and prebroadcast of A and B
2. Prefecth of C Matrix
3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs.
4. Special case when alpha/beta imag component is zero
5. Row/Col/General stride of Matrix C
AMD-Internal: [CPUPL-2998]
Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440
AVX512 optimised kernel for Double datatype supports
row and column major matrix
Packing kernel is column major implementation
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store
AMD-Internal: [CPUPL-2966]
Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9
AVX512 packing kernel supports:
1. Dcomplex datatype
2. Row and column major matrix
AVX512 packing kernel doesnot support:
1. General stride matrix
2. Fringe cases(only multiplies of 4 or 12 is supported)
3. Conjugate is not supported
scal2m will be used for above unsupported functionality
AVX512 packing kernel is column preferred kernel
If matrix is row major, we need to transpose block before storing it.
If matrix is column major, we directly store it
AMD-Internal: [CPUPL-3088]
Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737
- Main kernel is of size 24x8 and the associated fringe kernels
added are
- 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m
- 24x8, 24x7, 24x6, 24x5, 24x4, 24x3, 24x2, 24x1
- 16x8, 16x7, 16x6, 16x5, 16x4, 16x3, 16x2, 16x1
- 8x8, 8x7, 8x6, 8x5, 8x4, 8x3, 8x2, 8x1
- For fringe kernels, 24x? kernel handles 16 < m_remainder < 24
16x? kernel handles 8 < m_remainder <= 16
8x? kernel handles 0 < m_remainder <= 8
- Added a function 'bli_zen4_override_gemm_blkszs' to override
blocksizes and kernels to be used for SUP for supported storage
schemes.
- Updated the zen4 config to enable these kernels in zen4 path.
- Thresholds are yet to be derived.
- Updated CMakeLists.txt with DGEMM SUP kernels for windows build.
Kernel-specific details:
- K-loop is unrolled by 8 times to facilitate prefetch of B.
- For every load of one column of A, the corresponding column in
next panel of A is prefetched with T1 hint.
- One column of C is prefetched with T0 hint per iteration of LOOP2.
- TAIL_NITER is derived to be 3.
- For every unroll of k-loop, one row of B is prefetched with T0 hint.
- C-prefetching for row-storage is yet to be added.
- B-prefetching for col-storage is yet to be added.
- Support for C transpose is yet to added.
AMD-Internal: [CPUPL-2755], [CPUPL-2409]
Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48
- Added DGEMM and DTRSM row preferred micro kernels.
- DTRSM left lower and left upper micro kernels are added.
- DGEMM kernel is optimized for both row stored C and col
stored C.
AMD-Internal: [CPUPL-2745]
Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.
AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero
or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.
AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
- Currently the pointer received as function argument is
used for packing which causes only a partial copy of
input buffer to output buffer due to strange optimizations
by compiler.
- To fix this, instead of using a normal pointer for output
buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
buffer, the mysterious problem of incomplete copy has
been solved.
Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646
- Added kernels for all rv and rd variants.
- Main kernel is of size 6x64, and the associated fringe kernels
added are
- 4x64, 2x64, 1x64
- 6x32, 4x32, 2x32, 1x32
- 6x16, 4x16, 2x16, 1x16
- Updated the zen4 config to enable these kernels in zen4 path.
- Added C-prefetching to 6x? row-stored main kernels.
- C-prefetching for column storage yet to be added.
- K-loop unrolling for fringe kernels yet to be added.
AMD-Internal: [CPUPL-3002]
Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
to the function in place of context. Expert interface of ZSETV will
query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
BLAS macro interface only for single complex type and single complex,
float mixed type
AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
-Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics
code) when compiled using gcc. The presence of -fschedule-insns +
-fschedule-insns2 + -ftree-pre in O2 compiler optimization flags
results in the code being optimized to reduce data stalls, and results
in the usage of stack to store intermediate C register output. Disabling
-ftree-pre in gcc fixes the issue, even in the presence of the other
two flags.
AMD-Internal: [CPUPL-2971]
Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]
Change-Id: I576282382504b72072a6db068eabd164c8943627
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments
AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
-Implemented (r)ow preferential (d)ot product milli-kernels
(m and n variants) for dcomplex datatype along SUP path.
-These computational kernels extend the support for handling RRC and
CRC storage schemes along the SUP path. In case of BLAS api call,
it corresponds to the input cases with transa equal to T and
transb equal to N.
-In case of the B matrix being packed(conditionally), the inputs are
redirected to the existing (r)ow preferential (v)ector load optimized
kernels due to better performance.
-Added macro for vhsubpd assembly instruction, to support the arithmetic
for complex datatype in its interleaved storage.
AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in
the netlib LAPACK tests, specifically in:
./xlintstd < ../dtest.in > dtest.out
in the TESTING/LIN directory. Given time constraints, i.e. the need to
finalize code for AOCL 4.0 release, disable calls to AVX512 kernel
(i.e. always use the AVX2 kernel) for now, and aim to correct
bli_damaxv_zen_int_avx512 for AOCL 4.1.
AMD-Internal: [CPUPL-2590]
Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.
AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.
AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
- Updated zen4 configuration to enable AVX512 flags for the
reference kernels
- Reference and vector kernels will use the same compiler flags
AMD-Internal: [CPUPL-2533]
Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
files
AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
-Updated optimal threads in zgemm sup path for skinny matrices.
-Fine tuned the threshold values for small and sup paths
to improve overall zgemm.
-Zgemm small is selected for inputs with transb as N.
-Redirection of input among small, sup and native path
was fine tuned.
AMD-Internal : [CPUPL-1900]
Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f
- Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 Native SGEMM kernel.
AMD-Internal: [CPUPL-2385]
Change-Id: I1feae5ac79e960c6b26df24756d460243820b797
- Updated zen4 configuration to add -march=znver4 flag in the
compiler options if the gcc version is above or equal to 12
AMD-Internal: [CPUPL-1937]
Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f