- Warning is raised for the implicit declaration of bli_gemm_md_is_ccr()
when BLIS is configured with --disable-mixed-dt flag.
- Encapsulated the usage of bli_gemm_md_is_ccr( ... ) inside the
BLIS_ENABLE_GEMM_MD macro.
AMD-Internal: [CPUPL-4630]
Change-Id: Icc59b1bcd3a21492daaaf6bcec80a5bf67012ace
Non-zen configurations will use frame/compat/bla_gemm.c rather than
frame/compat/bla_gemm_amd.c. In the former, change dzgemm definition
to have dzgemm_blis_impl and optional dzgemm_ wrapper, as in the
AMD version.
AMD-Internal: [CPUPL-4082]
Change-Id: I66caff56e033bda8bb4ff2d60a16f7e52af122ea
Implement initial support for Zen5 systems:
- Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B
instructions.
- Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5
will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but
different hardware model will still be detected.
AMD-Internal: [CPUPL-3518]
Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600
Non aligned rntm_t struct can potentially have its first/last cache
line shared with other objects in memory. This could affect performance
depending on how much the shared cache lines are used. rntm_t struct is
aligned to 64 bytes to workaround this issue.
Change-Id: Id0956fca771be062ada9f81e8cd75ac1f290fd8e
- The bli_snormfv_unb_var1( ... ) and bli_cnormfv_unb_var1( ... )
functions posed an uninitialized pointer read coverity issue,
due to the local rntm_t object being declared as part of the
function scope, but initialized only on a need basis(i.e, when
attempting to pack x vector if incx != 1).
- The fix was to have the declaration and initialization inside
the case where incx != 1, thereby making the scope of the rntm_t
and mem_t objects more stringent.
- This required an additional condition to call the kernel in case
of unit stride.
AMD-Internal: [CPUPL-4278]
Change-Id: I763b1d4920532557749d8943f12b6df626aa5372
- Changed the threshold for using ZTRSM small
code path when multithreading is enabled.
- Very skinny matrices are not taken into
consideration in existing threshold tuning.
AMD-Internal: [CPUPL-4267]
Change-Id: I4294ec58a8535af7a9d618ae8f0d86407b66f341
Include bli_config.h before bli_system.h in
./frame/compat/cblas/src/cblas.h so that BLIS_ENABLE_SYSTEM is
defined correctly before it is needed. This copies the change
to ./frame/include/blis.h made in 1f527a93b9 (via merge
c6f3340125). Also standardize some comments and formatting
between blis.h and cblas.h
AMD-Internal: [CPUPL-4251]
Change-Id: Ie5cab646367f15003c25fa126344b02640d9106e
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
User control over code path using AOCL_ENABLE_INSTRUCTIONS
or BLIS_ARCH_TYPE only makes sense for fat binary builds.
Thus this functionality is now disabled by default for
single architecture builds. User can still override the default
selections by using configure options --enable-blis-arch-type
or --disable-blis-arch-type.
Other changes:
- include x86_64 family as using zen codepaths in cmake build system.
- Update help and error messages to include AOCL_ENABLE_INSTRUCTIONS.
AMD-Internal: [CPUPL-4202]
Change-Id: I7aa5fcf89df8675bcc12d81f81781de647e0fcf8
- For the inputs where either m or n is 1, based on right or
left side, it invokes c/z scalv kernel and post that it scales the
matrix post checking whether the input is blis conjugate transpose
or not.
- Previously the check condition was case sensitive *diaga = 'n',
and as a result, it is always executing the "else" code-part.
- Fixed the condition check.
AMD-Internal: [CPUPL-4204]
Change-Id: Iae2514c742ab17ac6c6e43036da095a74ad131c5
Changes in commit 64a1f786d5 (via merge c6f3340125) included in
./frame/include/bli_type_defs.h a prototype that uses the C
restrict keyword. When using C++ we need to provide a definition
for this C language keyword. This is done in bli_lang_defs.h which
was included in blis.h but not in cblas.h.
AMD-Internal: [CPUPL-4188]
Change-Id: I75d5f32599d18794331ff452e562eb42afb5ae93
* commit '5013a6cb':
More edits and fixes to docs/FAQ.md.
Fixed newly broken link to CREDITS in FAQ.md.
More minor fixes to FAQ.md and Sandboxes.md.
Updates to FAQ.md, Sandboxes.md, and README.md.
Safelist 'master', 'dev', 'amd' branches.
Re-enable and fix fb93d24.
Reverted fb93d24.
Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
Removed last vestige of #define BLIS_NUM_ARCHS.
Added new packm var3 to 'gemmlike'.
Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
Fix more copy-paste errors in the haswell gemmsup code.
Do a fast test on OSX. [ci skip]
Fix AArch64 tests and consolidate some other tests.
Use C++ cross-compiler for ARM tests.
Attempt to fix cxx-test for OOT builds.
Updated travis-ci.org link in README.md to .com.
Disabled (at least temporarily) commit 8e0c425.
Define BLIS_OS_NONE when using --disable-system.
Updated stale calls to malloc_intl() in gemmlike.
Blacklist clang10/gcc9 and older for 'armsve'.
Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
Moved lang defs from _macro_def.h to _lang_defs.h.
Minor tweaks to gemmlike sandbox.
Added local _check() code to gemmlike sandbox.
README.md citation updates (e.g. BLIS7 bibtex).
Tweaks to gemmlike to facilitate 3rd party mods.
Whitespace tweaks.
Add row- and column-strides for A/B in obj_ukr_fn_t.
Clean up some warnings that show up on clang/OSX.
Remove schema field on obj_t (redundant) and add new API functions.
Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
Disabled sanity check in bli_pool_finalize().
Implement proposed new function pointer fields for obj_t.
AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format
- New edge kernels that uses masked load-store
instructions for handling corner cases.
- Mask load-store instruction macros are added.
vmaskmovps, VMASKMOVPS for masked load-store.
- It improves performance by reducing branching overhead
and by being more cache friendly.
- Mask load-store is added only for row storage format
AMD-Internal: [CPUPL-4041]
Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
- Added a conditional check to see if the vectorized kernels
for DNRM2_ and DZNRM2_ can be called directly, without
incurring any framework overhead.
- The condition to satisfy this fast-path is for the size to be
such that the ideal threads required is 1, with the vector having
unit stride( so that packing at the framework-level can be avoided ).
AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
1. Added input parameter checking for the extension APIs
1. gemm_pack_get_size API
2. gemm_pack API
2. Additionally added early returns for these APIs when
m or n dimensions are 0.
3. Routines for input parameter check for all the 3
BLAS extension APIs - gemm_pack_get_size, gemm_pack and
gemm_compute are defined in:
frame/compat/check/bla_gemm_pack_compute_check.h
4. Added AOCL DTL TRACE for all the functions of
1. gemm_pack_get_size
2. gemm_pack
3. gemm_compute
AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.
- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.
AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
- According to BLAS Standards, SCALV should return when incx .le. 0.
- To make SCALV compliant to this, added an early return inside the BLAS
layer, for the cases where incx <= 0.
- Also, added early return for the case where alpha is a unit scalar.
AMD-Internal: [CPUPL-3562]
Change-Id: Id474fdd6ed9232226f5c5381d0398f43384e4a49
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
a layer higher.
- Added a scalar loop to handle compute in case of non-unit strides.
This loop ensures functionality in case packing fails at the
framework level.
AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
- When configured for haswell config "Warning unused variable 'zero'"
was throwed during compilation.
- Removed zero variable which is not being used
AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.
- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.
- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.
- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.
AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
- AOCL dynamic logic that determines the number of threads to be
launched has been modified.
AMD-Internal: [CPUPL-3956]
Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc
- BLAS compute checks updated to properly check for rs_c and cs_c.
- Updated BLAS compute checks to skip validity check if m==1 or n==1.
For the same reason, added a check just before to validate rs_c and
cs_c are greater than or equal to 1.
- Added tiny size tests to gtestsuite as a sanity check.
- Also updated the Invalid Input Tests to test for the updated checks.
AMD-Internal: [CPUPL-4140]
Change-Id: I984339ec7909778b58409ffcdbeed4ee33f28cfb
- Symbols for gemm_pack_get_size were not being exported properly when
BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
compute set of BLAS Extension APIs:
1. ?gemm_pack_get_size
2. ?gemm_pack
3. ?gemm_compute
AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
framework queries the architecture ID and calls the
vectorized kernel based on the architecture support.
- In case of not having the architecture support, we use
the default path based on the sumsqv method.
AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
present in the bli_gemv_unf_var2_amd.c file, as part of the
bli_sgemv_unf_var2( ... ) function. This was changed to
bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
from 6 to 5 ), in accordance to the function pointer present
in the zen3 and zen4 context files.
- Changed the accumulator type to double from float, inside the
fringe loop for unit-strides(vectorized path) and non-unit strides
(scalar code).
AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative
to BLIS_ARCH_TYPE. The details are:
1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both
supported, with BLIS_ARCH_TYPE taking precedence if both are set.
2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4"
code paths respectively in AMD focused builds, or for "skx" and
"haswell" respectively in Intel focused builds. These names are
not case-sensitive.
3. BLIS_ARCH_TYPE specifies the code path to use. If this is
unsupported, e.g. zen4 code path on a Milan or earlier system,
that code path is still executed, likely resulting in an illegal
instruction error.
4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on
the system (for AVX2 and AVX512), and try a "lower" ISA option if
the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic.
5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set.
AMD-Internal: [CPUPL-4105]
Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.
AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.
- Corrected A and B matrix buffer offsets, for cases where k > KC.
AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
is controlled by setting the environment variable
BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
assuming xerbla was not set to stop on error. Example call is
info = bli_info_get_info_value();
The default behaviour remains to print but don't stop on error,
i.e. the equivalent to
export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0
Implementation details:
- Values of the environment variables are stored and retrieved
from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
to 0 during initialization for all calls and updated by xerbla
if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
calls xerbla) will reinitialize info_value to 0 via call to
bli_thread_update_rntm_from_env
AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
- Added support for 2 new APIs:
1. sgemm_compute()
2. dgemm_compute()
These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
packed matrix identifier) and performs the GEMM operation:
C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
scheme isn't matching, and the respective matrix being loaded isn't
packed either, on-the-go packing has been enabled for such cases to
pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
it is the responsibility of the user to pack only one matrix with
alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
and compute APIs are forced to take n_threads=1.
AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
- Updated the bli_dnormfv_unb_var1( ... ) and
bli_znormfv_unb_var1( ... ) function to support
multithreaded calls to the respective computational
kernels, if and when the OpenMP support is enabled.
- Added the logic to distribute the job among the threads such
that only one thread has to deal with fringe case(if required).
The remaining threads will execute only the AVX-2 code section
of the computational kernel.
- Added reduction logic post parallel region, to handle overflow
and/or underflow conditions as per the mandate. The reduction
for both the APIs involve calling the vectorized kernel of
dnormfv operation.
- Added changes to the kernel to have the scaling factors and
thresholds prebroadcasted onto the registers, instead of
broadcasting every time on a need basis.
- Non-unit stride cases are packed to be redirected to the
vectorized implementation. In case the packing fails, the
input is handled by the fringe case loop in the kernel.
- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
cases of size = 2 ( and ) size = 1 or non-unit strides respectively.
AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.
- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with very
small dimensions.
- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.
- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.
-blis_dgemm_tiny function supports single threaded computation as of now.
AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
- Added an explicit function definition for ZGEMV var 1. This
removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
instead of querying it.
- With this change fringe loop is vectorized using SSE
instructions.
AMD-Internal:[CPUPL-3997]
Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
- The bli_snormfv_unb_var1( ... ) function returns early in
case of n = 1, and uses the blis macro bli_fabs( ... ) to
set the norm to the absolute value of the element.
- This macro inverts the sign bit even if the element is 0.0.
A check is added to re-invert the sign bit in this case, so
that the norm is set to 0.0 instead of -0.0.
- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
when n = 1.
AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
enabled.
Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.
AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
1. 4 new APIs are added to support packed compute GEMM operations
1. dgemm_pack_get_size
2. sgemm_pack_get_size
3. dgemm_pack
4. sgemm_pack
2. Pack_get_size API
1. Returns size in bytes required for packing of input
2. Requires identifier to identify the input matrix to
be packed
3. Additionally requires 3 integer parameters for input
dimensions
3. Packed buffer is allocated using the pack size computed
4. Pack API:
1. Performs full matrix packing of the input
2. Additionally, performs the alpha scaling
3. Packed buffer created contains the full packed matrix
5. The GEMM compute calls are required to be operated on the
packed buffer with alpha = 1 since alpha scaling is already
done by the Pack API
6. GEMM Pack API eliminate the cost of packing the input matrixes
by avoiding on the go pack in the GEMM 5 loop.
Packing of input matrixes are done when there is resue of
matrixes across different GEMM calls.
AMD-Internal: [CPUPL-3560]
Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3
Correct compiler warnings when building with configure --int-size=32
- bla_imatcopy.c: Cast ints to longs to match %ld format
specification in error printf statement and change this to fprintf
to stderr. Also copy this additional fprintf statement to other
variants of this function.
- bli_type_defs.h: siz_t should always be the same size as a pointer.
This corrects an issue in bli_malloc.c when casting from a pointer
to a siz_t integer value.
AMD-Internal: [CPUPL-3519]
Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab
bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to
frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore
bli_gemmt_sup_var1n2m.c as of commit 10ca8710f0 as variant
for non-AMD codepath builds.
AMD-Internal: [CPUPL-3838]
Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f
Several improvements to BLIS DTL functionality
- For APIs that report performance statistics, test for time=0.0
before dividing by time when calculating GFLOPS.
- Call AOCL_DTL_TRACE_EXIT in the parameter checking functions
inlined from ./frame/compat/check/bla_*_check.h
- Correct flop count for complex routines.
AMD-Internal: [CPUPL-3736]
Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768
Ensure functions bli_cpuid_query_id() and
bli_cpuid_query_model_id() are defined for all
architectures in bli_cpuid.c
AMD-Internal: [CPUPL-3838]
Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a