This reverts commit a028108cbb.
Reason for revert: With libgomp, scalability issues were observed with a
higher number of threads, leading to the use of fewer
threads. However, with different OpenMP libraries
like libomp, this scalability issue was not observed,
and using fewer threads resulted in performance loss.
The AOCL dynamic logic has been updated to select a
higher number of threads, considering the iomp OpenMP
library.
Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
incurred when the single-threaded path is optimally performant.
AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
Change usage of global_rntm and tl_rntm to elimate need
for mutex operations when accessing global_rntm. Usage of
these data structures is now as follows:
* global_rntm is set once during bli_init_apis and includes
all getenv calls to check BLIS threading and error printing
environment variables. global_rntm is then read-only.
* tl_rntm is intialized once from global_rntm on each
application thread. Any calls to BLIS set threading/ways
APIs will update tl_rntm for that application thread only
(Previously they updated global_rntm for all application threads).
* Re-initialize info_value in tl_rntm in every call to bli_init APIs.
* In bli_rntm_init_from_global() we initialize the local (per API
call) rntm as a copy of tl_rntm and then update threading values
in bli_thread_update_rntm_from_env() to reflect the current status
of OpenMP runtime ICVs.
AMD-Internal: [CPUPL-6168][SWLCSG-3143]
Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6
bli_nthreads_optimum is exported and called directly by AOCL libFLAME,
however it was only defined if building a multithreaded BLIS library
with AOCL_DYNAMIC enabled. Change to always define this function. If
BLIS is serial or if AOCL_DYNAMIC is disabled, this function returns
without modifying the supplied rntm.
Change-Id: Ie65690e9e6ec2a8ea77b3778f96676a68e6260be
- Reverted the change done for tuning ddotv API. When number of threads
is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads
are not calculated and as a result number of threads value is -1.
OpenMP threads are launched with -1 value. This results in crash.
This bug is fixed by correctly calculating number of threads.
AMD-Internal: [SWLCSG-3028][CPUPL-5689]
Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a
Different Zen processors may have a 512-bit, 256-bit or 128-bit
FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows
a selection of FP512 or FP256 width in BIOS settings. Add cpuid
code to detect the width and store an indication of it in the
global variable bli_fp_datapath. This should be accessed internally
via the function bli_cpuid_query_fp_datapath(). This functionality
is currently only enabled on x86_64 platforms and only currently
reports a value for AMD CPUs.
Also add Zen3 as a fallback path for any unknown AMD processors if
AVX512 is not supported or has been disabled.
AMD-Internal: [CPUPL-4415]
Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
in frame/3/bli_l3_sup.c, currently assumes that non-x86
platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
builds.
- Add guards around omp pragma to avoid possible gcc
compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.
AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- Create seperate AOCL Dynamic values for
multithreading dcopy API for zen1, zen2 and zen3
AMD-Internal: [CPUPL-5238]
Change-Id: I42f56393716edeeace8bfe71d7adab0ba7325b47
- Removed some of the unrolling factors that affected the
performance of AVX2 DAXPYV kernel. In addition to improving
the current performance on sizes compatible to single-threaded
runs, this will now perform better for tiny sizes as well
since the overhead to reach the computation is less.
- Updated the vector partitioning logic, by using
bli_thread_range_sub( ... ), which ensures that there is no
false sharing among multiple threads.
- Updated the AOCL-DYNAMIC logic for the API, to include thresholds
or zen4 and zen5 micro-architectures.
AMD-Internal: [CPUPL-5514]
Change-Id: Iee9edddac685334213cd6694421ab3df3547e930
- Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512"
kernel.
- Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512"
and "bli_scopyv_zen4_asm_avx512" kernels.
- Replaced existing load balancing algorithm for dcopy API with
"bli_thread_range_sub" algorithm.
- Included AOCL-dynamic values for optimial number of threads
for zen5 architecture.
AMD-Internal: [CPUPL-5238]
Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957
- Replaced 'bli_zaxpyv_zen_int5' kernel with optimised
'bli_zaxpyv_zen_int_avx512' kernel for zen4 and
zen5 config.
- Implemented multithreading support and AOCL-dynamic
for ZAXPY API.
- Utilized 'bli_thread_range_sub' function to achieve
better work distribution and avoid false sharing.
AMD-Internal: [CPUPL-5250]
Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b
* commit 'cfa3db3f':
Fixed bug in mixed-dt gemm introduced in e9da642.
Removed support for 3m, 4m induced methods.
Updated do_sde.sh to get SDE from GitHub.
Disable SDE testing of old AMD microarchitectures.
Fixed substitution bug in configure.
Allow use of 1m with mixing of row/col-pref ukrs.
AMD-Internal: [CPUPL-2698]
Change-Id: I961f0066243cf26aeb2e174e388b470133cc4a5f
* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
- Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512
computational kernel for DNRM2 API.
- Updated the header to include this kernel signature, as well
as the framework layer to use this function in case of ZEN4
and ZEN5 configurations.
- Updated the tipping points for ideal thread setting in DNRM2
for ZEN5 micro-architecture. These thresholds are specific
to the library's linkage to LLVM's OpenMP or GNU's OpenMp.
- Further abstracted the AOCL-DYNAMIC logic to separate functions
for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2).
- Further updated the ?NRM2 framework to accommodate the necessary
changes to invoke the newer AOCL-DYNAMIC functions and the AVX512
kernel, when needed.
- Added micro-kernel and memory tests for this kernel in GTestsuite,
to validate accuracy and out-of-bounds read and write.
AMD-Internal: [CPUPL-5265]
Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049
- Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs.
- Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it.
AMD-Internal: [CPUPL-4877]
Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba
Changes to how AOCL_ENABLE_INSTRUCTIONS handles requests
for different ISAs (i.e. BLIS sub-configurations):
- Add missing SSE and AVX options. These will all chose the
generic option in amdzen builds.
- For unsupported ISAs (e.g. AVX512 on Milan), select the
hardware's default sub-configuration instead of trying
to step down through alternative choices.
- For invalid options, or options not implemented in the BLIS
build (e.g. skx in amdzen build), select the hardware's
default sub-configuration instead of aborting.
Currently BLIS_ARCH_TYPE behaviour is not affected by these
changes.
AMD-Internal: [CPUPL-5078]
Change-Id: Idbd00d2806b1679889a9249878c51981c8d23b3f
- Updated nt_ideal to 12 for n_elem <= 2500000 and nt_ideal to 16 for
n_elem <= 4000000
AMD-Internal: [CPUPL-4408]
Change-Id: I97c143ab0d9b97e797358af93181c71d948757cc
- Fixed bug in DAXPYF MT kernel when incx != inca.
- Added AOCL Dynamic function for 1f kernels.
- Moved all DOTXF and AXPYF kernels into one file.
AMD-Internal: [CPUPL-4880]
Change-Id: I7d9f44625bc42fad4a9e5b218ecc382efdf22cbe
AOCL libFLAME optimizations directly call some internal
BLIS symbols. Export them to enable this to work with
the BLIS shared library.
AMD-Internal: [CPUPL-5044]
Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d
- Added AVX512 kernel for ZDOTV.
- Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support.
AMD-Internal: [CPUPL-5011]
Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8
Existing Design:
- GEMM AVX2 kernel performs computation and updates temporary C buffer
- Portion of temporary C buffer is copied to output C buffer
based on UPLO parameter
- For diagonal blocks, using GEMM kernels is not efficient
New Design: Implemented in current patch when UPLO='L'
- GEMMT kernel used for computation, temporary buffer is not required.
- Only required elements are computed using mask load store for all
fringe cases
- Exception: AVX2 code path is used when storage format is RRC, CRR, CRC
- AOCL-Dynamic is added based on dimension
- Check for AVX platform is added in SUP interface, It returns to
native implementation if hardware doesnot support AVX platform
- SUP ref_var2m is expanded for dcomplex datatype to avoid condition
check which exists for double datatype
AMD_Internal: [CPUPL-5006]
Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286
- Kernel dimensions are 4x4.
- Two kernels are implemented, Right Upper and
Right lower.
- In case of Left variants of TRSM, transpose is
induced so that Right variant kernels can be used.
- No packing is performed in these kernels.
- Changes are made in the threshold to pick ZTRSM small
code path.
- BLIS_INLINE is removed from signature of
"TRSMSMALL_KER_PROT".
- These kernels do not support "ENABLE_TRSM_PREINVERSION".
- Newly added kernels do not support conjugate
transpose.
- Added multithreading to ZTRSM small code path.
AMD-Internal: [CPUPL-4324]
Change-Id: I683b1d5239593e54f433e7f27497d72dfbd9141c
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
factor 8 and 32.
- Multithreading is also added to the DAXPYf
kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
in ZEN4
AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
- Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_
using respective AVX512 intrinsics including masked
load and store operations.
- Implemented AVX512 kernels for scopy_, dcopy_ and
zcopy_ using assembly language to prevent loss of
performance during the translation of intrinsics.
- Updated the dcopy_blis_impl( ... ) and
zcopy_blis_impl( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Implemented OpenMP parallelization for dcopyv_ and
zcopyv_ APIs, while scopyv_ and ccopyv_ only support
single thread.
AMD-Internal: [CPUPL-4854]
Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e
Correct the order of tests in bli_cpuid.c to test all known zen
AVX512 platforms before considering fallback tests on AVX512
support. This avoids builds with "configure auto" or
"cmake -DBLIS_CONFIG_FAMILY=auto" incorrectly selecting zen5
sub-configuration on zen4 systems.
AMD-Internal: [CPUPL-4966]
Change-Id: I8706382e2df7c9ae4bb456e3a7f465053e15beea
- Comparision using bli_eqsc is slower than direct comparison.
- Changed comparision logic for 1x1 matrix
from bli_sqsc to direct comparision.
AMD-Internal: [CPUPL-4324]
Change-Id: Ifb2d0ad7a97c8bf33b66d624a7ecc53e38c1c803
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
Implement initial support for Zen5 systems:
- Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B
instructions.
- Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5
will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but
different hardware model will still be detected.
AMD-Internal: [CPUPL-3518]
Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600
User control over code path using AOCL_ENABLE_INSTRUCTIONS
or BLIS_ARCH_TYPE only makes sense for fat binary builds.
Thus this functionality is now disabled by default for
single architecture builds. User can still override the default
selections by using configure options --enable-blis-arch-type
or --disable-blis-arch-type.
Other changes:
- include x86_64 family as using zen codepaths in cmake build system.
- Update help and error messages to include AOCL_ENABLE_INSTRUCTIONS.
AMD-Internal: [CPUPL-4202]
Change-Id: I7aa5fcf89df8675bcc12d81f81781de647e0fcf8
* commit '5013a6cb':
More edits and fixes to docs/FAQ.md.
Fixed newly broken link to CREDITS in FAQ.md.
More minor fixes to FAQ.md and Sandboxes.md.
Updates to FAQ.md, Sandboxes.md, and README.md.
Safelist 'master', 'dev', 'amd' branches.
Re-enable and fix fb93d24.
Reverted fb93d24.
Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
Removed last vestige of #define BLIS_NUM_ARCHS.
Added new packm var3 to 'gemmlike'.
Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
Fix more copy-paste errors in the haswell gemmsup code.
Do a fast test on OSX. [ci skip]
Fix AArch64 tests and consolidate some other tests.
Use C++ cross-compiler for ARM tests.
Attempt to fix cxx-test for OOT builds.
Updated travis-ci.org link in README.md to .com.
Disabled (at least temporarily) commit 8e0c425.
Define BLIS_OS_NONE when using --disable-system.
Updated stale calls to malloc_intl() in gemmlike.
Blacklist clang10/gcc9 and older for 'armsve'.
Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
Moved lang defs from _macro_def.h to _lang_defs.h.
Minor tweaks to gemmlike sandbox.
Added local _check() code to gemmlike sandbox.
README.md citation updates (e.g. BLIS7 bibtex).
Tweaks to gemmlike to facilitate 3rd party mods.
Whitespace tweaks.
Add row- and column-strides for A/B in obj_ukr_fn_t.
Clean up some warnings that show up on clang/OSX.
Remove schema field on obj_t (redundant) and add new API functions.
Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
Disabled sanity check in bli_pool_finalize().
Implement proposed new function pointer fields for obj_t.
AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.
- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.
AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
- AOCL dynamic logic that determines the number of threads to be
launched has been modified.
AMD-Internal: [CPUPL-3956]
Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative
to BLIS_ARCH_TYPE. The details are:
1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both
supported, with BLIS_ARCH_TYPE taking precedence if both are set.
2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4"
code paths respectively in AMD focused builds, or for "skx" and
"haswell" respectively in Intel focused builds. These names are
not case-sensitive.
3. BLIS_ARCH_TYPE specifies the code path to use. If this is
unsupported, e.g. zen4 code path on a Milan or earlier system,
that code path is still executed, likely resulting in an illegal
instruction error.
4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on
the system (for AVX2 and AVX512), and try a "lower" ISA option if
the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic.
5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set.
AMD-Internal: [CPUPL-4105]
Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
is controlled by setting the environment variable
BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
assuming xerbla was not set to stop on error. Example call is
info = bli_info_get_info_value();
The default behaviour remains to print but don't stop on error,
i.e. the equivalent to
export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0
Implementation details:
- Values of the environment variables are stored and retrieved
from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
to 0 during initialization for all calls and updated by xerbla
if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
calls xerbla) will reinitialize info_value to 0 via call to
bli_thread_update_rntm_from_env
AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
Ensure functions bli_cpuid_query_id() and
bli_cpuid_query_model_id() are defined for all
architectures in bli_cpuid.c
AMD-Internal: [CPUPL-3838]
Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
- Existing logic is not picking the ideal number
of threads for some problem sizes.
- Problem size and their corresponding ideal number
of threads are retuned for daxpy in aocl dynamic.
AMD-Internal: [CPUPL-3484]
Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7
Details:
- Eliminated the need for override function in SUP for GEMMT/SYRK.
- New set of block sizes, kernels and kernel preferences
are added to cntx data structure for level-3 triangular routines.
- Added supporting functions to set and get the above parameters from cntx.
- Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
In case they are not set, use the default block sizes/kernels of
Level-3 SUP.
AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
- Missing break statement will result in unexpected control flow.
This function will not launch the threads for the API in question
according to the AOCL dynamic logic without the break statement.
AMD-Internal: [CPUPL-3436]
Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698
Some text files were missing a newline at the end of the file.
One has been added.
Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104
AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
architectures, but function bli_check_valid_model_id() should
check the provided model_id against the suitable range within
the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
checking of a user specified value of BLIS_ARCH_TYPE against
the enabled configurations is delayed to a separate function,
bli_arch_check_id().
- Default selection based on hardware can be overridden using the
BLIS_MODEL_TYPE environment variable. Valid values are:
Genoa, Bergamo, Genoa-X, Milan, Milan-X
Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
but will result in the default option for that architecture being
selected. This is different to specifying an incorrect value of
BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
the --rename-blis-model-type argument to configure (or cmake
equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
--rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
currently only for AMD cpus. Functions are provided to query
these from other parts of the code, namely:
uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()
AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702