Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
initialized as any unsigned integer, including 0. However, a length of
0 could be problematic given that malloc(0) is undefined and therefore
variable across implementations. As a safety measure, we check for
block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
Change-Id: I1e885d887aaba5e73df091ef52e6c327fd6418de
Details:
- The current mechanism for growing a pool_t doubles the length of the
block_ptrs array every time the array length needs to be increased
due to new blocks being added. However, that logic did not take in
account the new total number of blocks, and the fact that the caller
may be requesting more blocks that would fit even after doubling the
current length of block_ptrs. The code comments now contain two
illustrating examples that show why, even after doubling, we must
always have at least enough room to fit all of the old blocks plus
the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
from growing any pool_t that is initialized with a block_ptrs length
of 0. (Previously, the memory pool for packed buffers of C was
initialized with a block_ptrs length of 0, but because it is unused
this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
Change-Id: Ie4963c56e03cbc197d26e29f2def6494f0a6046d
- Fine-tuned the thread allocation logic for parallelizing DGEMMT for the cases where n <= 220. This results in performance improvement in multi-threaded DGEMMT for small values of n.
AMD-Internal: [CPUPL-2215]
Change-Id: I2654bc64d2dc43c2db911e0c9175755be3aa8ba5
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
from TRSM context it will invoke AVX2 GEMM kernels instead
of the default AVX-512 GEMM kernels.
- The default context has the block sizes for AVX512 GEMM
kernels, however, TRSM uses AVX2 GEMM kernels and they
need different block sizes.
- Added new API bli_zen4_override_trsm_blkszs(). It overrides
default block sizes in context with block sizes needed for
AVX2 GEMM kernels.
- Added new API bli_zen4_restore_default_blkszs(). It restores
The block sizes to there default values (as needed by default
AVX512 GEMM kernels).
- Updated bli_trsm_front() to override the block sizes in the
context needed by TRSM + AVX2 GEMM kernels and restore them
to the default values at the end of this function. It is done
in bli_trsm_front() so that we override the context before
creating different threads.
AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.
AMD-Internal: [CPUPL-2102]
Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
- auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set
irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at
runtime. Currently the auto_factor is enabled if num_threads > 0 and not
reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm.
This results in gemm/gemmt SUP path applying 2x2 factorization of
num_threads, and thereby modifying the preset factorization. This issue
is not observed in native path since factorization happens without
checking auto_factor value.
- Setting omp threads to n_threads using omp_set_num_threads after the
global_rntm n_threads update in bli_thread_set_num_threads. This ensures
that in bli_rntm_init_from_global, omp_get_max_threads returns the same
value as set previously.
AMD-Internal: [CPUPL-2137]
Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426
Description:
1. Tuned number of threads to achive better performance
for dtrmm
AMD-Internal: [ CPUPL-2100 ]
Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593
Description:
1. Decision logic to choose optimal number of threads for
given input dgemm dimensions under aocl dynamic feature
were retuned based on latest code.
2. Updated code in few file to avoid compilation warnings.
3. Added a min check for nt in bli_sgemv_var1_smart_threading
function
AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
- Cache aware factorization.
Experiments shows that ic,jc factorization based on m,n gives better
results compared to mu,nu on a generic data set in SUP path. Also
slight adjustments in the factorizations w.r.t matrix data loads can
help in improving perf further.
- Moving native path inputs to SUP path.
Experiments shows that in multi-threaded scenarios if the per thread
data falls under SUP thresholds, taking SUP path instead of native path
results in improved performance. This is the case even if the original
matrix dimensions falls in native path. This is not applicable if A
matrix transpose is required.
- Enabling B matrix packing in SUP path.
Performance improvement is observed when B matrix is packed in cases
where gemm takes SUP path instead of native path based on per thread
matrix dimensions.
AMD-Internal: [CPUPL-659]
Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.
AMD-Internal: [CPUPL-2103]
Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
sizes 64<= m,n <= 256, by keeping the number of
threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
used for all variants.
AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.
Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
Details:
- During runtime, Application can set the desired number of threads using
standard OpenMP API omp_set_num_threads(nt).
- BLIS Library uses standard OpenMP API omp_get_max_threads() internally,
to fetch the latest value set by the application.
- This value will be used to decide the number of threads in the subsequent
BLAS calls.
- At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable
will be given precedence, over the OpenMP standard API omp_set_num_threads(nt)
and OMP_NUM_THREADS environment variable.
- Order of precedence followed during BLIS Initialization is as follows
1. Valid value of BLIS_NUM_THREADS
2. omp_set_num_threads(nt)
3. valid value of OMP_NUM_THREADS
4. Number of cores
- After BLIS initialization, if the Application issues omp_set_num_threads(nt)
during runtime, number of threads set during BLIS Initialization,
is overridden by the latest value set by the Application.
- Existing precedence of BLIS_*_NT environment variables and the decision of
optimal number of threads over the number of threads derived from the
above process remains as it is.
AMD-Internal: [CPUPL-2076]
Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8
Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding #include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is #included before bli_type_defs.h).
This is why the #include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
Change-Id: Ie7efdea366481ce25075cb2459bdbcfd52309717
We added new API to check if the CPU architecture has support
for AVX instruction. This API was calling CPUID instruction
every time it is invoked. However, since this information does
not change at runtime, it is sufficient to determine it once
and use the cached results for subsequent calls. This optimization
is needed to improve performance for small size matrix vector
operations.
AMD-Internal: [CPUPL-2009]
Change-Id: If6697e1da6dd6b7f28fbfed45215ea3fdd569c5f
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Number of threads are reduced to 1 when the dimensions
are very low.
- Removed uninitialized xmm compilation warning in trsm small
Change-Id: I23262fb82729af5b98ded5d36f5eed45d5255d5b
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
-- Reverted changes made to include lp/ilp info in binary name
This reverts commit c5e6f885f0.
-- Included BLAS int size in 'make showconfig'
-- Renamed amdepyc configuration to amdzen
Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
This sanity check (checking top_index != 0) which has been disabled
earlier in bli_pool_reinit() by commenting out bli_abort() was unnecessarily
computing top_index - this whole statement is commented out.
Change-Id: If296754ca8cba3a69d023d4a7ec891f1cbce1d6a
bli_pack_set_pack_b() API wrongly inits pack_a element of rntm object
instead of pack_b. Fixed this bug.
Change-Id: I267493ab3ff0bade478d1157799a6fa5b51c7970
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
- The function "bli_determine_blocksize_b_sub" uses a modulo operation
using b_alg. In cases where trsm blocksizes are not set, using this
function with trsm blocksizes being zero will lead to FPE.
- Added a check to avoid calling the above function with zeroes as
parameters.
Change-Id: I770aaa13125b55320a68ff9fc3da782111e0978a
Details:
- The field "trsm_blkszs" in cntx enables us to set and use blocksizes
that are specific to trsm only instead of using commom L3 blocksizes.
- While querying blocksizes for TRSM, Added a check to see if
trsm-specific blocksizes are set, if they are not set, we continue
trsm execution by using common Level-3 blocksizes.
- TRSM-specific blocksizes( if required ) needs to be set in
cntx_init function of respective configuration.
Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
adding threshold functions under a macro enables these functions for
all the configs under a family. This can be avoided by adding function
pointers to cntx which can be queried from cntx during run-time
based on the config chosen.
Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
-- In existing design memory pool supports buffers of only one
size, This size is determined at compile time to support buffer
needed for biggest block size and data type. However, the size
calculation is not generic as it considers sizes only for GEMM.
Also it assumes that the buffer will not be used for any other
purpose than packing operations.
-- If the new buffer is requested whose size is bigger than existing
size, The pool is re-initialized to contain buffers of new size,
however, this is done only if the pool is empty. If the pool is not
empty the execution is aborted. This is undesirable as in
mulithreaded scenarios it is possible that different threads needs
buffers of different sizes at the same time (i.e. while other thread
is still in middle of the operation).
-- This commit removes the restriction of single size buffer, when
new buffer is requested whose size if bigger than exiting one, no
re-init is done. New buffer of required size is allocated, added
to the pool and returned to the user.
Change-Id: I20acdb60eb06ab2e53366d51713aa83c4b2df0da
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed N > 3 condition from bli_dgemm_small.
Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
Details:
- Added a check condition in GEMMT native path to choose
update_triang routines based on whether the kernels are
row-preferred or column preferred.
- Moved the zen-specific SUP thresholds under BLIS_CONFIG_EPYC
macro. For non-zen architectures, it falls back to L3 SUP
thresholds.
- Modified SUP code path to always choose 2m variant. 1n variant
is not implemented for GEMMT.
Change-Id: Ifdd55815c588f645e337de80b5b9d1864f6b5dd3
Details:
1. Added aocl-dynamic for dtrsm native path
When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
trsm_small is doing better than native irrespective of
number of threads
Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
Details:
- when AOCL dynamic is enabled, the decision to choose ST Vs MT
to solve SYRK is taken based on dimensions of matrices.
- Decisions to choose optimum number of threads will be updated in
the subsequent commits.
- Only local copy of rntm is modified by AOCL Dynamic feature.
global_rntm data structure remains unchanged in order to keep
track of original number of threads set by application.
- Added an early-exit condition in bli_nthreads_optimum when nt =1
or nt=-1. This ensures that AOCL dynamic feature is not used when
threading is set using BLIS_IC_NT or BLIS_JC_NT.
Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0
Details:
- Added decision logic to choose between SUP and native implementations
of SYRK for zen2 architectures.
- For architectures other than zen2 it will be redirected to gemm
threshold function.
Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
global_rntm data structure remains unchanged in order to keep track of
original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
number of threads recommended by the application.
AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
Details:
- Added bli_syrksup function that internally uses gemmt implementation.
- Modified OAPI of syrk to call SUP before proceeding to the
conventional implementation.
- Copied gemmsup threshold function for syrk temporarily. Thresholds are
yet to be derived for syrk.
Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
(in terms of strerror_s) from bli_thread.h to bli_env.c. It was
likely left behind in bli_thread.h in a previous commit, when code
that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
find any other instance of strerror_r being used in BLIS, so I moved
the #define directly to bli_env.c rather than place it in bli_env.h.)
The code that uses strerror_r is currently disabled, though, so this
commit should have no affect on BLIS.
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.
Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.
Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via #454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Implemented support for the user manually overriding the automatic
subconfiguration selection that happens at runtime. This override
can be requested by setting the BLIS_ARCH_TYPE environment variable.
The variable must be set to the arch_t id (as enumerated in
bli_type_defs.h) corresponding to the desired subconfiguration. If a
value outside this enumerated range is given, BLIS will abort with an
error message. If the value is in the valid range but corresponds to a
subconfiguration that was not activated at configure-time/compile-time,
BLIS will abort with a (different) error message. Thanks to decandia50
for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
internal data structure within the gks. If this address is NULL, then
it indicates that the subconfig corresponding to the arch_t id passed
into the function was not compiled into BLIS. This function is used
in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
is returned for the latter of the two abort scenarios mentioned above,
along with a corresponding error message and a function to perform
the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
auto-detect.x executable during configure-time. This cpp branch is
similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
going forward. Also added a convenient debug switch that outputs the
compilation command for the auto-detect.x executable and exits.