Memory allocated for pointer chars_for_dt was not freed at the
end of function in testsuite.
Freeing up of the buffer fixed the issue.
AMD-Internal: [CPUPL-3932]
Change-Id: I432c3ff95d289159f02a871b6d4fff5ab252ea9e
Updating cmake files to place include folder under
blis directory in new cmake system on windows.
AMD-Internal: [CPUPL-2748]
Change-Id: I650cca95193f7c89b39648ac1bda1fa1093b1560
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
* commit 'b683d01b':
Use extra #undef when including ba/ex API headers.
Minor preprocessor/header cleanup.
Fixed typo in cpp guard in bli_util_ft.h.
Defined eqsc, eqv, eqm to test object equality.
Defined setijv, getijv to set/get vector elements.
Minor API breakage in bli_pack API.
Add err_t* "return" parameter to malloc functions.
Always stay initialized after BLAS compat calls.
Renamed membrk files/vars/functions to pba.
Switch allocator mutexes to static initialization.
AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
- TRSM and GEMM has different blocksizes in zen4, in order
to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
are removed. Instead of overriding the default blocksizes,
TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
and GEMM pack buffers have to be packed with default blocksizes.
To check if we are packing for TRSM, "family" argument is added
in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
it is not set then BLIS_GEMM_UKR has to be used. This functionality
has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
signature of bli_packm_init_pack.
AMD-Internal: [CPUPL-3781]
Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
Description: We have seen the library dependency issue when we are
linking the libomp.lib or libiomp5md.lib while building the library
for static multithreaded scenario. So we are removing the linking of
openmp library for static multithreaded blis library build. So that
user can link any openmp library(libomp.lib or libiomp5md.lib) while
building their applications by linking static multithreaded blis library.
AMD-Internal: [SWLCSG-2196]
Change-Id: I96722f3587ee555af12de664957c211c56fcf03d
- In zen4 arch TRSM and GEMM have different blocksizes.
TRSM call will update blockize in global cntx object
which is incorrect for GEMM, when GEMM and TRSM are
called in parallel.
- Hence using a local copy of cntx which holds blocksizes
would help.
AMD-Internal: [CPUPL-3019]
Change-Id: I5f0f5675b3917d2a11d582ac626ca5d8f4752c53
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
files
AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
- Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called
from TRSM context it will invoke AVX2 GEMM kernels instead
of the default AVX-512 GEMM kernels.
- The default context has the block sizes for AVX512 GEMM
kernels, however, TRSM uses AVX2 GEMM kernels and they
need different block sizes.
- Added new API bli_zen4_override_trsm_blkszs(). It overrides
default block sizes in context with block sizes needed for
AVX2 GEMM kernels.
- Added new API bli_zen4_restore_default_blkszs(). It restores
The block sizes to there default values (as needed by default
AVX512 GEMM kernels).
- Updated bli_trsm_front() to override the block sizes in the
context needed by TRSM + AVX2 GEMM kernels and restore them
to the default values at the end of this function. It is done
in bli_trsm_front() so that we override the context before
creating different threads.
AMD-Internal: [CPUPL-2225]
Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.
AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
Updated the windows build system to link the user given openmp
library using -DOpenMP_libomp_LIBRARY=<Desired lib name> option
using command line or through cmake-gui application to build
blis library and its test applications. If user not given any
openmp library then by default openmp library will be
C:/Program Files/LLVM/lib/libomp.lib.
Change-Id: I07542c79454496f88e65e26327ad76a7f49c7a8c
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.
AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
1. Added support in cmake scripts for linking libomp for blis multithreading build.
2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.
AMD Internal : [CPUPL-1630]
Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
- Verified merge of all gemmt related files
- Corrected testsuite/src/test_gemmt.c
AMD-Internal: [CPUPL-1561]
Change-Id: I5fe03b8e3754e4ed96c927ef7570be6f9d4f528b
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
- Added an err_t* parameter to memory allocation functions including
bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
already use the return value to return the allocated memory address,
they can't communicate errors to the caller through the return value.
This commit does not employ any error checking within these functions
or their callers, but this sets up BLIS for a more comprehensive
commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
bli_type_defs.h. This was done so that what remains of bli_malloc.h
can be included after the definition of the err_t enum. (This ordering
was needed because bli_malloc.h now contains function prototypes that
use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
bli_param_macro_defs.h. These functions provide easy checks for error
codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
the API for bli_malloc_user(), which is an exported symbol in the
shared library. However, it's quite possible that the only application
that calls bli_malloc_user()--indeed, the reason it is was marked for
symbol exporting to begin with--is the BLIS testsuite. And if that's
the case, this breakage won't affect anyone. Nonetheless, the "major"
part of the so_version file has been updated accordingly to 4.0.0.
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.
Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
Details:
- Exit early from libblis_test_amaxv_check() when the vector dimension
(length) of x is 0. This allows the module to run when the testsuite
driver passes in a problem size of 0. Thanks to Meghana Vankadari for
alerting us to this issue via #459.
- Note: All other testsuite modules appear to work with problem sizes
of 0, except for the microkernel modules. I chose not to "fix" those
modules because a failure (or segmentation fault, as happens in this
case) is actually meaningful in that it alerts the developer that some
microkernels cannot be used with k = 0. Specifically, the 'haswell'
kernel set contains microkernels that preload elements of B. Those
microkernels would need to be restructured to avoid preloading in
order to support usage when k = 0.
Details:
- Fixed a bug in the gemmt testsuite module that only manifested when
testing of gemmt is enabled but testing of gemv is disabled. The bug
was due to a copy-paste error dating back to the introduction of gemmt
in 88ad841.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
by a factor of five in order to avoid residuals that unnecessarily
fall in the MARGINAL range. This commit should fix#455. Thanks to
@nagsingh for reporting this issue.
The testsuite coveres all combinations of upper, lower, transpose and API formats.
AMD Internal: [CPUPL-1021]
Change-Id: I2a1d79eba1dcaf4217fd9c2c346bd6173b80a782
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]
Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
testsuite. The tests were failing because the computation (bli_gemv())
that performs the numerical check was not able to properly travserse
the matrix operands bx1 and b11 that are views into the micropanel of
B, which has duplicated/broadcast elements under the power9 subconfig.
(For example, a micropanel of B with duplication factor of 2 needs to
use a column stride of 2; previously, the column stride was being
interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
static functions in bli_obj_macro_defs.h. (Previously, only the
function bli_obj_set_strides() was defined. Amazing to think that we
got this far without these former functions.)
- Updated/expounded upon comments.
Details:
- Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
gemmtrsm) that only manifested once we began running with parameters
that mimic those of power9. The problem was rooted in the way those
modules were creating objects (and thus allocating memory) for the
micropanel operands to the microkernel being tested. Since power9
duplicates/broadcasts elements of B in memory, we needed an easy way
of asking for more than one storage element per logical element in
the matrix. I incorrectly expressed this as:
bli_obj_create( datatype, k, n, ldbp, 1, &bp );
The problem here is that bli_obj_create() is exceedingly efficient
at calculating the size it passes to malloc() and doesn't allocate a
full leading dimension's worth of elements for the last column (or
row, in this example). This would normally not bother anyone since
you're not supposed to access that memory anyway. But here, my
attempted "hack" for getting extra elements was insufficient, and
needed to be changed to:
bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );
That is, the extra elements needed to be baked into the dimensions of
the matrix object in order to have the intended effect on the number
of elements actually allocated. Thanks to Jeff Hammond for reporting
this bug.
- Fixed a typically harmless memory leak in the aforementioned test
modules (the objects for the packed micropanels were not being freed).
- Updated/expanded a common comment across all three ukr test modules.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
Implemented and registered power9 dgemm ukernel.
Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.