Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
65 KiB
Release Notes
Note: For some releases, credit for individuals' contributions are shown in parentheses.
Contents
- Changes in 0.8.1
- Changes in 0.8.0
- Changes in 0.7.0
- Changes in 0.6.1
- Changes in 0.6.0
- Changes in 0.5.2
- Changes in 0.5.1
- Changes in 0.5.0
- Changes in 0.4.1
- Changes in 0.4.0
- Changes in 0.3.2
- Changes in 0.3.1
- Changes in 0.3.0
- Changes in 0.2.2
- Changes in 0.2.1
- Changes in 0.2.0
- Changes in 0.1.8
- Changes in 0.1.7
- Changes in 0.1.6
- Changes in 0.1.5
- Changes in 0.1.4
- Changes in 0.1.3
- Changes in 0.1.2
- Changes in 0.1.1
- Changes in 0.1.0
- Changes in 0.0.9
- Changes in 0.0.8
- Changes in 0.0.7
- Changes in 0.0.6
- Changes in 0.0.5
- Changes in 0.0.4
- Changes in 0.0.3
- Changes in 0.0.2
- Changes in 0.0.1
Changes in 0.8.1
March 22, 2021
Improvements present in 0.8.1:
Framework:
- Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro
BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macroBLIS_ENABLE_AUTO_PRIME_NUM_THREADSin the appropriate configuration family'sbli_family_*.h. (Jeff Diamond) - Changed default value of
BLIS_THREAD_RATIO_Mfrom 2 to 1, which leads to slightly different automatic thread factorizations. - Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
- Relocated the general stride handling for
gemmsup. This fixed an issue wherebygemmwould fail to trigger to conventional code path for cases that use general stride even aftergemmsuprejected the problem. (RuQing Xu) - Fixed an incorrect function signature (and prototype) of
bli_?gemmt(). (RuQing Xu) - Redefined
BLIS_NUM_ARCHSto be part of thearch_tenum, which means it will be updated automatically when defining future subconfigs. - Minor code consolidation in all level-3
_front()functions. - Reorganized Windows cpp branch of
bli_pthreads.c. - Implemented
bli_pthread_self()and_equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Kernels:
- Added low-precision POWER10
gemmkernels via apower10sandbox. This sandbox also provides an API for implementations that use these kernels. See thesandbox/power10/POWER10.mddocument for more info. (Nicholai Tukanov) - Added assembly
packmkernels for thehaswellkernel set and registered tohaswell,zen, andzen2subconfigs accordingly. Thes,c, andzkernels were modeled on thedkernel, which was contributed by AMD. - Reduced KC in the
skxsubconfig from 384 to 256. (Tze Meng Low) - Fixed bugs in two
haswelldgemmsup kernels, which involved extraneous assembly instructions left over from when the kernels were first written. (Kiran Varaganti, Bhaskar Nallani) - Minor updates to all of the
gemmtrsmkernels to allow division by diagonal elements rather that scaling by pre-inverted elements. This change was applied tohaswellandpenrynkernel sets as well as reference kernels, 1m kernels, and the pre-broadcast B (bb) format kernels used by thepower9subconfig. (Bhaskar Nallani) - Fixed incorrect return type on
bli_diag_offset_with_trans(). (Devin Matthews)
Build system:
- Output a pkgconfig file so that CMake users that use BLIS can find and incorporate BLIS build products. (Ajay Panyala)
- Fixed an issue in the the configure script's kernel-to-config map that caused
skxkernel flags to be used when compiling kernels from thezenkernel set. This issue wasn't really fixed, but rather tweaked in such a way that it happens to now work. A more proper fix would require a serious rethinking of the configuration system. (Devin Matthews) - Fixed the shared library build rule in top-level Makefile. The previous rule was incorrectly only linking prerequisites that were newer than the target (
$?) rather than correctly linking all prerequisites ($^). (Devin Matthews) - Fixed
cc_vendorfor crosstool-ng toolchains. (Isuru Fernando) - Allow disabling of
trsmdiagonal pre-inversion at compile time via--disable-trsm-preinversion.
Testing:
- Fixed obscure testsuite bug for the
gemmttest module that relates to its dependency ongemv. - Allow the
amaxvtestsuite module to run with a dimension of 0. (Meghana Vankadari)
Documentation:
- Documented auto-reduction for prime numbers of threads in
docs/Multithreading.md. - Fixed a missing
trans_targument in the API documentation forher2k/syr2kindocs/BLISTypedAPI.md. (RuQing Xu) - Removed an extra call to
free()in the level-1v typed API example code. (Ilknur Mustafazade)
Changes in 0.8.0
November 19, 2020
Improvements present in 0.8.0:
Framework:
- Implemented support for the level-3 operation
gemmt, which performs agemmon only the lower or only the upper triangle of a square matrix C. For now, only the conventional/large code path (and not the sup code path) is provided. This support also includesgemmtAPIs in the BLAS and CBLAS compatibility layers. (AMD) - Added a C++ template header,
blis.hh, containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header,cblas.hh. These headers are installed only when running theinstalltarget withINSTALL_HHset toyes. (AMD) - Disallow
randv,randm,randnv, andrandnmfrom producing vectors and matrices with 1-norms of zero. - Changed the behavior of user-initialized
rntm_tobjects so that packing of A and B is disabled by default. (Kiran Varaganti) - Transitioned to using
boolkeyword instead of the previous integer-basedbool_ttypedef. (RuQing Xu) - Updated all inline function definitions to use the cpp macro
BLIS_INLINEinstead of thestatickeyword. (Giorgos Margaritis, Devin Matthews) - Relocated
#include "cpuid.h"directive frombli_cpuid.htobli_cpuid.cso that applications can#includebothblis.handcpuid.h. (Bhaskar Nallani, Devin Matthews) - Defined
xerbla_array_()to complement the netlib routinexerbla_array(). (Isuru Fernando) - Replaced the previously broken
ref99sandbox with a simpler, functioning alternative. (Francisco Igual) - Fixed a harmless bug whereby
herkwas callingtrmm-related code for determining the blocksize of KC in the 4th loop.
Kernels:
- Implemented a full set of
sgemmsupassembly millikernels and microkernels forhaswellkernel set. - Implemented POWER10
sgemmanddgemmmicrokernels. (Nicholai Tukanov) - Added two kernels (
dgemmanddpackm) that employ ARM SVE vector extensions. (Guodong Xu) - Implemented explicit beta = 0 handling in the
sgemmmicrokernel inbli_gemm_armv7a_int_d4x4.c. This omission was causing testsuite failures in the newgemmttestsuite module forcortexa15builds given that thegemmtcorrectness check relies ongemmwith beta = 0. - Updated
void*function arguments in referencepackmkernels to use the native pointer type, and fixed a related dormant type bug inbli_kernels_knl.h. - Fixed missing
restrictqualifier insgemmmicrokernel prototype forknlkernel set header. - Added some missing n = 6 edge cases to
dgemmsupkernels. - Fixed an erroneously disabled edge case optimization in
gemmsupvariant code. - Various bugfixes and cleanups to
dgemmsupkernels.
Build system:
- Implemented runtime subconfiguration selection override via
BLIS_ARCH_TYPE. (decandia50) - Output the python found during
configureinto thePYTHONvariable set inbuild/config.mk. (AMD) - Added configure support for Intel oneAPI via the
CCenvironment variable. (Ajay Panyala, Devin Matthews) - Use
-O2for all framework code, potentially avoiding intermitten issues withf2c'ed packed and banded code. (Devin Matthews) - Tweaked
zen2subconfiguration's cache blocksizes and registered full suite ofsgemmanddgemmmillikernels. - Use the
-fomit-frame-pointercompiler optimization option in thehaswellandskxsubconfigurations. (Jeff Diamond, Devin Matthews) - Tweaked Makefiles in
test,test/3, andtest/supso that running any of the usual targets without having first built BLIS results in a helpful error message. - Add support for
--complex-return=[gnu|intel]toconfigure, which allows the user to toggle between the GNU and Intel return value conventions for functions such ascdotc,cdotu,zdotc, andzdotu. - Updates to
cortexa9,cortexa53compilation flags. (Dave Love)
Testing:
- Added a
gemmtmodule to the testsuite and a standalone test driver to thetestdirectory, both of which exercise the newgemmtfunctionality. (AMD) - Support creating matrices with small or large leading dimensions in
test/suptest drivers. - Support executing
test/supdrivers with unpacked or packed matrices. - Added optional
numactlusage totest/3/runme.sh. - Updated and/or consolidated octave scripts in
test/3andtest/sup. - Increased
dotxaxpyftestsuite thresholds to avoid falseMARGINALresults during normal execution. (nagsingh)
Documentation:
- Added Epyc 7742 Zen2 ("Rome") performance results (single- and multithreaded) to
Performance.mdandPerformanceSmall.md. (Jeff Diamond) - Documented
gemmtAPIs inBLISObjectAPI.mdandBLISTypedAPI.md. (AMD) - Documented commonly-used object mutator functions in
BLISObjectAPI.md. (Jeff Diamond) - Relocated the operation indices of
BLISObjectAPI.mdandBLISTypedAPI.mdto appear immediately after their respective tables of contents. (Jeff Diamond) - Added missing perl prerequisite to
BuildSystem.md. (pkubaj, Dilyn Corner) - Fixed missing
conjyparameter inBLISTypedAPI.mddocumentation forher2andsyr2. (Robert van de Geijn) - Fixed incorrect link to
shiftdinBLISTypedAPI.md. (Jeff Diamond) - Mention example code at the top of
BLISObjectAPI.mdandBLISTypedAPI.md. - Minor updates to
README.md,FAQ.md,Multithreading.md, andSandboxes.mddocuments.
Changes in 0.7.0
April 7, 2020
Improvements present in 0.7.0:
Framework:
- Implemented support for multithreading within the sup (skinny/small/unpacked) framework, which previously was single-threaded only. Note that this feature works harmoniously with the selective packing introduced into the sup framework in 0.6.1. (AMD)
- Renamed
bli_thread_obarrier()andbli_thread_obroadcast()functions to drop the 'o', which was left over from whenthrcomm_tobjects tracked both "inner" and "outer" communicators. - Fixed an obscure
int-to-packbuf_ttype conversion error that only affects certain C++ compilers (including g++) when compiling application code that includes the BLIS header fileblis.h. (Ajay Panyala) - Added a missing early
returnstatement inbli_thread_partition_2x2(), which provides a slight optimization. (Kiran Varaganti)
Kernels:
- Fixed the semantics of the
bli_amaxv()kernels ('s' and 'd') within thezenkernel set. Previously, the kernels (incorrectly) returned the index of the last element whose absolute value was largest (in the event there were multiple of equal value); now, it (correclty) returns the index of the first of such elements. The kernels also now return the index of the first NaN, if one is encountered. (Mat Cross, Devin Matthews)
Build system:
- Warn the user at configure-time when hardware auto-detection returns the
genericsubconfiguration since this is probably not what they were expecting. (Devin Matthews) - Removed unnecessary sorting (and duplicate removal) on
LDFLAGSincommon.mk. (Isuru Fernando) - Specify the full path to the location of the dynamic library on OSX so that other dynamic libraries that depend on BLIS know where to find the library. (Satish Balay, Jed Brown)
Testing:
- Updated and reorganized test drivers in
test/supso that they work for either single-threaded or multithreaded purposes. (AMD) - Updated/optimized octave scripts in
test/supfor use with octave 5.2.0. - Minor updates/tweaks to
test/1m4m.
Documentation:
- Updated existing single-threaded sup performance graphs with new data and added multithreaded sup graphs to
docs/PerformanceSmall.md. - Added mention of Gentoo support under the external packages section of the
README.md. - Tweaks to
docs/Multithreading.mdthat clarify that setting anyBLIS_*_NTvariable to 1 will be considered manual specification for the purposes of determining whether to auto-factorize viaBLIS_NUM_THREADS. (AMD)
Changes in 0.6.1
January 14, 2020
Improvements present in 0.6.1:
Framework:
- Added support for pre-broadcast when packing B. This causes elements of B to be repeated (broadcast) in the packed copy of B so that subsequent vector loads will result in the element already being pre-broadcast into the vector register.
- Added support for selective packing to
gemmsup(controlled via environment variables and/or therntm_tobject). (AMD) - Fixed a bug in
sdsdot_sub()that redundantly added the "alpha" scalar and a separate bug in the order of typecasting intermediate products insdsdot_(). (Simon Lukas Märtens, Devin Matthews) - Fixed an obscure bug in
bli_acquire_mpart_mdim()/bli_acquire_mpart_ndim(). (Minh Quan Ho) - Fixed a subtle and complicated bug that only manifested via the BLAS test drivers in the
genericsubconfiguration, and possibly any other subconfiguration that did not register complex-domaingemmukernels, or registered ONLY real-domain ukernels as row-preferential. (Dave Love) - Always use
sumsqvto computenormfvinstead of the "dot product trick" that was previously employed for performance reasons. (Roman Yurchak, Devin Matthews, and Isuru Fernando) - Fixed bug in
thrinfo_tdebugging/printing code.
Kernels:
- Implemented and registered an optimized
dgemmmicrokernel for thepower9kernel set. (Nicholai Tukanov) - Pacify a
restrictwarning in thegemmtrsm4m1reference ukernel. (Dave Love, Devin Matthews)
Build system:
- Fixed parsing in
vpu_count()on some SkylakeX workstations. (Dave Love) - Reimplemented
bli_cpuid_query()for ARM to usestdio-based functions instead ofpopen(). (Dave Love) - Use
-march=znver1for clang onzen2subconfig. - Updated
-marchflags forsandybridge,haswellsubconfigurations to use newer syntax (e.g.haswellinstead ofcore-avx2andsandybridgeinstead ofcorei7-avx. - Correctly use
-qopenmp-simdfor reference kernels when compiling with icc. (Victor Eikjhout) - Added
-marchsupport for select gcc version ranges where flag syntax changes or new flags are added. The ranges we identify are: versions older than 4.9.0; versions older than 6.1.0 (but newer than 4.9.0); versions older than 9.1.0 (but newer than 6.1.0). - Use
-funsafe-math-optimizationsand-ffp-contract=fastfor all reference kernels when using gcc or clang. - Updated MC cache blocksizes used by
haswellsubconfig. - Updated NC cache blocksizes used by
zensubconfig. - Fixed a typo in the context registration of the
cortexa53subconfiguration inbli_gks.c. (Francisco Igual) - Output a more informative error when the user manually targets a subconfiguration that configure places in the configuration blacklist. (Tze Meng Low)
- Set execute bits of shared library at install-time. (Adam J. Stewart)
- Added missing thread-related symbols for export to shared libraries. (Kyungmin Lee)
- Removed (finally) the
attic/windowsdirectory since we offer Windows DLL support via AppVeyor's build artifacts, and thus that directory was only likely confusing people.
Testing:
- Fixed latent testsuite microkernel module bug for
power9subconfig. (Jeff Hammond) - Added
test/1m4mdriver directory for test drivers related to the 1m paper. - Added libxsmm support to
test/sup drivers. (Robert van de Geijn) - Updated
.travis.ymlanddo_sde.shto automatically accept SDE license and download SDE directly from Intel. (Devin Matthews, Jeff Hammond) - Updated standalone test drivers to iterate backwards through the specified problem space. This often helps avoid the situation whereby the CPU doesn't immediately throttle up to its maximum clock frequency, which can produce strange discontinuities (sharply rising "cliffs") in performance graphs.
- Pacify an unused variable warning in
blastest/f2c/lread.c. (Jeff Hammond) - Various other minor fixes/tweaks to test drivers.
Documentation:
- Added libxsmm results to
docs/PerformanceSmall.md. - Added BLASFEO results to
docs/PerformanceSmall.md. - Added the page size and location of the performance drivers to
docs/Performance.mdanddocs/PerformanceSmall.md. (Dave Love) - Added notes to
docs/Multithreading.mdregarding the nuances of setting multithreading parameters the manual way vs. the automatic way. (Jérémie du Boisberranger) - Added a section on reproduction to
docs/Performance.mdanddocs/PerformanceSmall.md. (Dave Love) - Documented Eigen
-march=nativehack indocs/Performance.mdanddocs/PerformanceSmall.md. (Sameer Agarwal) - Inserted multithreading links and disclaimers to
BuildSystem.md. (Jeff Diamond) - Fixed typo in description for
bli_?axpy2v()indocs/BLISTypedAPI.md. (Shmuel Levine) - Added "How to Download BLIS" section to
README.md. (Jeff Diamond) - Various other minor documentation fixes.
Changes in 0.6.0
June 3, 2019
Improvements present in 0.6.0:
Framework:
- Implemented small/skinny/unpacked (sup) framework for accelerated level-3 performance when at least one matrix dimension is small (or very small). For now, only
dgemmis optimized, and this new implementation currently only targets Intel Haswell through Coffee Lake, and AMD Zen-based Ryzen/Epyc. (The existing kernels should extend without significant modification to Zen2-based Ryzen/Epyc once they are available.) Also, multithreaded parallelism is not yet implemented, though application-level threading should be fine. (AMD) - Changed function pointer usages of
void*to new, typedef'ed typevoid_fp. - Allow compile-time disabling of BLAS prototypes in BLIS, in case the application already has access to prototypes.
- In
bli_system.h, define_POSIX_C_SOURCEto200809Lif the macro is not already defined. This ensures that things such as pthreads are properly defined by an application that has#include "blis.h"but omits the definition of_POSIX_C_SOURCEfrom the command-line compiler options. (Christos Psarras)
Kernels:
- None.
Build system:
- Updated the way configure and the top-level Makefile handle installation prefixes (
prefix,exec_prefix,libdir,includedir,sharedir) to better conform with GNU conventions. - Improved clang version detection. (Isuru Fernando)
- Use pthreads on MinGW and Cygwin. (Isuru Fernando)
Testing:
- Added Eigen support to test drivers in
test/3. - Fix inadvertently hidden
xerbla_()in blastest drivers when building only shared libraries. (Isuru Fernando, M. Zhou)
Documentation:
- Added
docs/PerformanceSmall.mdto showcase new BLIS small/skinnydgemmperformance on Kaby Lake and Epyc. - Added Eigen results (3.3.90) to performance graphs showcased in
docs/Performance.md. - Added BLIS thread factorization info to
docs/Performance.md.
Changes in 0.5.2
March 19, 2019
Improvements present in 0.5.2:
Framework:
- Added support for IC loop parallelism to the
trsmoperation. - Implemented a pool-based small block allocator and a corresponding
configureoption (enabled by default), which minimizes the number of calls tomalloc()andfree()for the purposes of allocating small blocks (on the order of 100 bytes). These small blocks are used by internal data structures, and the repeated allocation and freeing of these structures could, perhaps, cause memory fragmentation issues in certain application circumstances. This was never reproduced and observed, however, and remains entirely theoretical. Still, the sba should be no slower, and perhaps a little faster, than repeatedly callingmalloc()andfree()for these internal data structures. Also, the sba was designed to be thread-safe. (AMD) - Refined and extended the output enabled by
--enable-mem-tracing, which allows a developer to follow memory allocation and release performed by BLIS. - Initialize error messages at compile-time rather than at runtime. (Minh Quan Ho)
- Fixed a potential situation whereby the multithreading parameters in a
rntm_tobject that is passed into an expert interface is ignored. - Prevent a redefinition of
ftnlenin thef2c_types.hin blastest. (Jeff Diamond)
Kernels:
- Adjusted the cache blocksizes in the
zensub-configuration forfloat,scomplex, anddcomplexdatatypes. The previous values, taken directly from thehaswellsubconfig, were merely meant to be reasonable placeholders until more suitable values were determined, as had already taken place for thedoubledatatype. (AMD) - Rewrote reference kernels in terms of simplified indexing annotated by the
#pragma omp simddirective, which a compiler can use to vectorize certain constant-bounded loops. The#pragmais disabled via a preprocessor macro layer if the compiler is found byconfigureto not support-fopenmp-simd. (Devin Matthews, Jeff Hammond)
Build system:
- Added symbol-export annotation macros to all of the function prototypes and global variable declarations for public symbols, and created a new
configureoption,--export-shared=[public|all], that controls which symbols--only those that are meant to be public, or all symbols--are exported to the shared library. (Isuru Fernando) - Standardized to using
-O3in various subconfigs, and also-funsafe-math-optimizationsfor reference kernels. (Dave Love, Jeff Hammond) - Disabled TBM, XOP, LWP instructions in all AMD subconfigs. (Devin Matthews)
- Fixed issues that prevented using BLIS on GNU Hurd. (M. Zhou)
- Relaxed python3 requirements to allow python 3.4 or later. Previously, python 3.5 or later was required if python3 was being used. (Dave Love)
- Added
thunderx2sub-configuration. (Devangi Parikh) - Added
power9sub-configuration. For now, this subconfig only uses reference kernels. (Nicholai Tukanov) - Fixed an issue with
configurefailing on OSes--including certain flavors of BSD--that contain a slash '/' character in the output ofuname -s. (Isuru Fernando, M. Zhou)
Testing:
- Renamed
test/3m4mdirectory totest/3. - Lots of updates and improvements to Makefiles, shell scripts, and matlab scripts in
test/3.
Documentation:
- Added a new
docs/Performance.mddocument that showcases single-threaded, single-socket, and dual-socket performance results ofsingle,double,scomplex, anddcomplexlevel-3 operations in BLIS, OpenBLAS, and MKL/ARMPL for Haswell, SkylakeX, ThunderX2, and Epyc hardware architectures. (Note: Other implementations such as Eigen and ATLAS may be added to these graphs in the future.) - Updated
README.mdto include new language on external packages. (Dave Love) - Updated
docs/Multithreading.mdto be more explicit about the fact that multithreading is disabled by default at configure-time, and the fact that BLIS will run executed single-threaded at runtime by default if no multithreaded specification is given. (M. Zhou)
Changes in 0.5.1
December 18, 2018
Improvements present in 0.5.1:
Framework:
- Added mixed-precision support to the 1m method implementation.
- Track internal scalar datatypes in the
obj_tinfo bitfield. This allows slightly better handling of scalars during mixed-datatypegemmcomputation. - Fixed a bug that allowed execution of 1m with mixed-precision
gemm, despite such usage not yet being officially supported. (Devangi Parikh) - Added missing internal calls to
bli_init_once()inbli_thread_set_num_threads()andbli_thread_set_ways(). (Ali Emre Gülcü)
Kernels:
- Redefined
packmkernels to handle edge cases and zero-filling, and updated their APIs accordingly. This was needed in order to fully support the use of non-default/non-reference packm kernels. (Devin Matthews)
Build system:
- Disallow explicit requests to use 64-bit integers in the BLAS API while simultaneously using 32-bit integers in the BLIS API. (Jeff Hammond, Devin Matthews)
- Fixed an msys2/Windows build failure. (Isuru Fernando, Costas Yamin)
- Fixed a MinGW build failure. (Isuru Fernando)
- Disabled
arm32,arm64configuration families since we don't yet have logic to choose the correct context at runtime.
Testing:
- Make sure the testsuite fails for
NaN,Infin input operands. (Devin Matthews) - Added
hemmdriver totest/3m4m. - Minor updates to
test/mixeddtdrivers, matlab scripts. - Added additional matlab plotting scripts to
test/3m4m.
Documentation:
- Updated
docs/Multithreading.mdto include discussion of setting affinity via OpenMP. - Updated
docs/Testsuite.mdto include discussion of mixed-datatype settings. - Updated
docs/MixedDatatypes.mdto include a brief section on running the testsuite to exercise mixed-datatype functionality, and other minor updates. - Fixed broken links in
docs/KernelsHowTo.md. (Richard Goldschmidt) - Spelling fixes in FAQ. (Rhys Ulerich)
- Updated 3-clause license comment blocks to refer generically to copyright holders rather than just the original copyright holder, UT-Austin.
Changes in 0.5.0
October 25, 2018
Improvements present in 0.5.0:
Framework:
- Implemented support for matrix operands of mixed datatypes (domains and precisions) within the
gemmoperation. - Added configure-time option to use slab or round-robin partitioning within JR and IR loops of most level-3 operations' macrokernels.
- Allow parallelism in the JC loop for
trsm_l, which previously was unnecessarily disabled. (Field Van Zee, Devangi Parikh) - Added Fortran-77/90-compatible APIs for some thread-related functions. (Kay Dewhurst)
- Defined a new level-1d operation
shiftd, which adds a scalar value to every element along an arbitrary diagonal of a matrix. - Patched an issue (#267) that may arise when linking against OpenMP-configured BLIS from which parallelism is requested at runtime and a level-3 operation (e.g.
gemm) is called from within an OpenMP parallel region of an application where OpenMP nested parallelism is disabled. (Devin Matthews)
Kernels:
- Imported SkylakeX
dgemmmicrokernel fromskx-reduxbranch, which contains optimizations (mostly better prefetching on C) over the previous implementation. (Devin Matthews) - Renamed/relocated level-3
zenmicrokernels to thehaswellkernel set. Please see a recent message to blis-devel for more information on this rename [1]. - BG/Q kernel fixes. (Ye Luo)
Build system:
- Added support for building Windows DLLs via AppVeyor [2], complete with a built-in implementation of pthreads for Windows, as well as an implementation of the
pthread_barrier_*()APIs for use on OS X. (Isuru Fernando, Devin Matthews, Mathieu Poumeyrol, Matthew Honnibal) - Defined a
cortexa53sub-configuration, which is similar tocortexa57except that it uses slightly different compiler flags. (Mathieu Poumeyrol) - Added python version checking to
configurescript. - Added a script to automate the regeneration of the symbols list file (now located in
build/libblis-symbols.def). - Various tweaks in preparation for BLIS's inclusion within Debian. (M. Zhou)
- Various fixes and cleanups.
Testing:
- Added tests for
cortexa15andcortexa57in Travis CI. (Mathieu Poumeyrol) - Added tests for mixed-datatype
gemmand the simulation of application-level threading (salt) in Travis CI. - Add statistics-collecting
irun.pyscript. - Include various threading parameters in the initial comment block of testsuite output.
- Various fixes and cleanups.
Documentation:
- Added
MixedDatatypes.mddocumentation for mixed-datatypegemm. - Added example code demonstrating use of mixed-datatype
gemm(object API only). - Added description of
shiftdtoBLISTypedAPI.mdandBLISObjectAPI.md. - Added "Known issues" sections to
Multithreading.mdandSandboxes.md. - Updated
FAQ.md. - Various other documentation updates.
[1] https://groups.google.com/forum/?fromgroups#!topic/blis-devel/pytWRjIzxVY [2] https://ci.appveyor.com/project/shpc/blis/
Changes in 0.4.1
August 30, 2018
Improvements present in 0.4.1:
Framework:
- Improved thread safety by homogenizing all critical sections to unconditionally use pthread mutexes. (AMD)
- Fixed
bli_finalize(), which had become uncallable due to sharingpthread_once_tobjects between the initialization and finalization steps. This manifested as a rather large memory leak (many megabytes) if/when the application manually finalized BLIS in the middle of its execution. (Devangi Parikh, Field Van Zee) - Fixed a minor memory leak in the global kernel structure. (Devangi Parikh, Field Van Zee)
- Replaced extensive use of function "chooser" macros in object API functions with use of a new set of functions using the suffix
_qfp()("query function pointer"). These functions can be used to query function pointers for most families of typed functions. - Fixed an obscure integer size bug due to improper use of integer literal constants with
va_arg(). This oddly manifested as LP64 systems using the general stride output case of microkernels even when the output matrix storage matched that of the microkernel output preference. (Devangi Parikh, Field Van Zee)
Kernels:
- Fixed compilation of
armv7akernels. (Mathieu Poumeyrol)
Build system:
- Generate makefile fragments within the
objdirectory rather than inconfig,kernels,ref_kernels, andframe. This allows a user to perform an out-of-tree build even if the BLIS source distribution is read-only. (Devin Matthews) - Allow a dependent sub-project such as example code or the testsuite to compile and link against an installation of BLIS rather than implicitly searching for a local (uninstalled) copy. (Victor Eijkhout, Field Van Zee)
- Fixed a link error that manifested after building only a shared library (e.g.
--disable-static) and then trying to build a dependent sub-project such as example code or the testsuite. (Sajid Ali) - Changed
testmake target of top-levelMakefileto behave more likecheckby printing a color-coded characterization of the test results. - Fixed the
-poption toconfigure, which had likely been broken since May 7, 2018. The--prefixoption was unaffected. (Dave Love) - Running
configureno longer requires a C++ compiler given that a C++ compiler was only ever envisioned for optional use in the sandbox. (Devangi Parikh, Field Van Zee)
Testing:
- Added the ability to "simulate" multiple application-level threads in the testsuite by executing the individual experiments with multiple threads. This should make it easier to test for thread-safety in the future. (AMD)
- Removed borderline useless wall clock time from test drivers' output.
Documentation:
- Updated typed and object API documents to include language on
rntm_tparameters in the expert interfaces. - Updates to
README.md, including language on sandboxes. - Added table of make targets to
BuildSystem.md. - Added missing language to
ConfigurationHowTo.mdon updating the architecture string array inbli_arch.c. (Devangi Parikh, Field Van Zee)
Changes in 0.4.0
July 27, 2018
Framework:
- Added support for "sandboxes" for employing alternative
gemmimplementations. A ready-to-use reference C99 sandbox provides developers with a starting point for experimentation. - Separated expert, non-expert typed APIs (levels 1v, 1d, 1f, 1m, 2, and 3, and utility functions).
- Defined new
rntm_tstructure and API to provide a uniform way of storing user-level threading information (equivalent ofBLIS_NUM_THREADSandBLIS_*_NTenvironment variables), and also conveying that information to expert APIs. (Matthew Honnibal, Nathaniel Smith) - Renamed various
obj_taccessor macros, converted to static functions, and inserted explicit typecasting to facilitate #including blis.h from a C++ application. (Jacob Gorm Hansen) - Cache and reuse
arch_tarchitecture query result at runtime. (Devin Matthews) - Implemented object-based functions
bli_projm()/_projv(), which project objects from one domain to another (within the same precision), andbli_castm()/_castv(), which typecast objects from one datatype to another. - Implemented object-based functions
bli_setrm()/_setrv(),bli_setim()/_setiv(), which allow the caller to broadcast a scalar to all real elements or all imaginary elements within an object. - Enforce consistent datatypes in most object APIs.
- For native execution, initialize a context's virtual microkernel slots to the function pointers of native microkernels. This simplifies query routines and paves the way for more generalized use of virtual microkernels beyond those for induced methods.
- Various bugfixes. (Devangi Parikh)
Kernels:
- Re-expressed x86_64 microkernels in terms of assembly language macros, which support lower- and upper-case, AT&T and Intel syntax. (Devin Matthews)
- Various bugfixes. (Robin Christ, Francisco Igual, Devangi Parikh, qnerd)
Build system:
- Added support for
--libdir,--includedirconfigure options. (Nico Schlömer) - Adopted Linux-like shared library versioning and enabled building shared libraries by default.
- Improved shared library handling on OS X. (Alex Arslan)
- Added configure support for preset
CFLAGS,LDFLAGS. (Dave Love) - Improvements to version file handling.
- Implemented configure option hack for circumventing small/limited values of
ARG_MAX. - Reorganized
cc,cc_vendordetection responsibilities fromMakefiletoconfigure. (Alex Arslan) - Cross-compilation fixes.
- Preliminary Windows ABI suport using
clang, appveyor. (Isuru Fernando) - Better support for typical development environment on OpenBSD, FreeBSD. (Alex Arslan)
- Bumped shared library
sonameversion number to 1.0.0. - Various build system fixes and cleanups. (Mathieu Poumeyrol, Nico Schlömer, Tony Skjellum)
Testing:
- Rewrote Travis CI testing config file and supporting logic to use Intel's SDE emulator. This allows multiple x86_64 microarchitectures to be tested regardless of what hardware Travis happens to be using at the time. (Devin Matthews)
- Added
docs/studieshardware-specific test driver directory to track individual performance studies. (Devangi Parikh) - Streamlined
testsuite/input.operationsfile format.
Documentation:
- Relocated all wiki documents to a
docsdirectory and adjusted all links, andREADME.md, accordingly. - Added a
CONTRIBUTING.mdfile to top-level directory. - Added
docs/CodingConventions.md. - Added
docs/Sandboxes.md. - Added
docs/BLISObjectAPI.md. - Renamed and updated
docs/BLISTypedAPI.md. - Updated
docs/KernelsHowTo.md. - Updated
docs/BuildSystem.md. (Stefanos Mavros) - Updated
docs/Multithreading.md. - Updated indentation in
docs/ConfigurationHowTo.mdfor easier reading. - Added example code for the BLIS typed API in
examples/tapi. - Expanded existing example code for the object API in
examples/oapi. - Added links to RHEL/Fedora and Debian packages to
README.md. - Various cleanups. (Tony Skjellum, Dave Love, Nico Schlömer)
Changes in 0.3.2
April 28, 2018
- Added
setijm,getijmoperations for updating and querying individual matrix elements via the object API. - Added
examples/oapidirectory containing a code-based tutorial on using the object-based API in BLIS. - Track separate reference kernel
CFLAGSfor each sub-configuration. - Added support for blacklisting sub-configurations based on the assembler/binutils.
- Added 64-bit support to BLAS test drivers.
- Various bugfixes.
Changes in 0.3.1
April 4, 2018
- Enable use of new zen kernels in haswell sub-configuration.
- Added row-storage optimizations to zen
dotxfkernels (now also used by haswell). - Integrated an
f2ced version of the BLAS test drivers from netlib LAPACK into BLIS build system (e.g.make testblas,make checkblas). See the Testsuite document for more info. Also scheduled these BLAS drivers to execute regularly via Travis CI. - Added a new
make checktarget that executes a fast version of the BLIS testsuite as well as the BLAS test drivers (primarily targeting package maintainers). - Allow individual operation overriding in the BLIS testsuite. (This makes it easy to quickly test one or two operations of interest.)
- Added build system support for libmemkind. If present,
hbw_malloc()is used as the default value forBLIS_MALLOC_POOLinstead ofmalloc(). It can be disabled via--disable-memkind. - Tweaks and fixes to BLAS compatibility layer, courtesy of the new BLAS test drivers.
- Output the active sub-configuration in testsuite output header.
- Allow arbitrary nesting of "umbrella" configuration families in
config_registry, allowing us to define x86_64 in terms of amd64 and intel64. - Added skx and knl to intel64 (and by proxy, x86_64) configuration families.
- Implemented basic support for ARM hardware detection (via
/proc/cpuinfo). - Various bugfixes.
Changes in 0.3.0
February 23, 2018
This version contains significant improvements from 0.2.2. Major changes include:
- Real and complex domain (s,d,c,z) assembly-based gemm microkernels for AMD's Zen microarchitecture. (AMD, Field Van Zee)
- Real domain (s,d) assembly-based
gemmtrsm_landgemmtrsm_umicrokernels for Zen. (AMD, Field Van Zee) - Real domain (s,d) intrinsics-based
amaxv,axpyv,dotv,dotxv,scalv,axpyf, anddotxfkernels for Zen. (AMD, Field Van Zee) - Generalized the configuration system to allow multi-configuration builds targeting configuration "families". A single sub-configuration is chosen at runtime via some heuristic, such as querying CPUID (e.g. runtime hardware detection). This change was extensive and required a reorganization of the build system, configuration semantics, reference kernels, a new naming scheme for native kernels, and a rewrite of the global kernel structure (gks). Please see the rewritten Configuration Guide for details.
- Implemented runtime hardware detection for x86_64 hardware.
- Reimplemented configure-time hardware detection in terms of new runtime hardware detection code, which queries for CPU features rather than individual models.
- Implemented library self-initialization by rewriting
bli_init()in terms ofpthread_once()and inserting invocations tobli_init()in key places throughout BLIS. The expectation is that through normal use of any BLIS API (BLAS, typed BLIS, or object-based BLIS), the user no longer needs to explicitly initialize the library, and thatbli_finalize()should never be called by the user unless he is absolutely sure he no longer needs BLIS functionality. Related to this: global scalar constants (BLIS_ONE,BLIS_ZERO, etc.) are now statically initialized and thus ready to use immediately. Collectively, these changes provide improved thread safety at the application level. - Compile with and install a single monolithic (flattened)
blis.hheader to (1) speed up compilation and (2) reduce the number of build product files. - Added a sub-API for setting multithreading environment variables at runtime. For a few examples, please see the Multithreading guide.
- Reimplemented OpenMP/pthread barriers in terms of GNU atomic built-ins.
- Other small changes and fixes.
Changes in 0.2.2
May 2, 2017
- Implemented the 1m method for inducing complex matrix multiplication. (Please see ACM TOMS publication "Implementing high-performance complex matrix multiplication via the 1m method" for more details.)
- Switched to simpler
trsm_rimplementation. - Relaxed constraints that
MC % NR = 0andNC % MR = 0, as this was only needed for the more sophisticatedtrsm_rimplementation. - Automatic loop thread assignment. (Devin Matthews)
- Updates to
.travis.ymlconfiguration file. (Devin Matthews) - Updates to non-default haswell microkernels.
- Match storage format of the temporary micro-tiles in macrokernels to that of the microkernel storage preference for edge cases.
- Added support for Intel's Knight's Landing. (Devin Matthews)
- Added more flexible options to specify multithreading via the configure script. (Devin Matthews)
- OS X compatibility fixes. (Devin Matthews)
- Other small changes and fixes.
Also, thanks to Elmar Peise, Krzysztof Drewniak, and Francisco Igual for their contributions in reporting/fixing certain bugs that were addressed in this version.
Changes in 0.2.1
October 5, 2016
- Implemented distributed
thrinfo_tstructure management. (Ricardo Magana) - Redesigned BLIS's level-3 algorithmic control tree structure. (suggested by Tyler Smith)
- Consolidated
gemm,herk, andtrmmblocked variants into one set of three bidirectional variants. - Integrated a new "memory broker" (
membrk_t) abstraction in place of the previous memory allocator, which allows one set of pools per broker (or, in other words, per memory space). (Ricardo Magana) - Reorganized multithreading APIs, including more consistent namespace prefixes:
bli_thrinfo_*(),bli_thrcomm_*(), etc. - Added
randnm,randnvoperations, which produce random powers of two in a narrow range, and integrated a corresponding option into the testsuite. (suggested by AMD) - Reclassified
amaxvas a level-1v operation and kernel. - Added complex
gemmmicrokernels for haswell, which have register allocations consistent with the existing 6x16sgemmand 6x8dgemmmicrokernels. - Adjusted existing microkernels to work properly when BLIS is configured to use 32-bit integers. (Devin Matthews)
- Relaxed alignment constraints in sandybridge and haswell microkernels. (Devin Matthews)
- Define CBLAS API with
f77_intinstead ofint, which means the BLAS compatibility integer size is inherited by the CBLAS compatibility layer. (Devin Matthews) - Added an alignment switch to the testsuite to globally enable/disable starting address and leading dimension alignment. (suggested by Devin Matthews)
- Various enhancements to configure script. (Devin Matthews)
- Avoid compiling BLAS/CBLAS compatibility layer when it is disabled via configure. (suggested by Devin Matthews)
- Disabled compilation of object-based blocked partitioning code for level-2 operations, as it was already functionally disabled.
- Fixes and tweaks to POSIX thread support. (Tyler Smith, Jeff Hammond)
- Other small changes and fixes.
Changes in 0.2.0
April 11, 2016
Most of BLIS 0.2.0's changes are contained within a single commit, 537a1f4 (aka "the big commit"). An executive summary of the most consequential of these changes follows:
- BLIS has been retrofitted with a new data structure, known as a "context," affecting virtually every internal API for every computational operation, as well as many supporting, non-computational functions that must access information within the context.
- In addition to appearing within these internal APIs, the context--specifically, a pointer to a
cntx_t--is now present within all user-level datatype-aware APIs, e.g.bli_zgemm(), appearing as the last argument. - User-level object APIs, e.g.
bli_gemm(), were unaffected and continue to be "context-free." However, these APIs were duplicated so that corresponding "context-aware" APIs now also exist, differentiated with an_exsuffix (for "expert"). - Contexts are initialized very soon after a computational function is called (if one was not passed in by the caller) and are passed all the way down the function stack, even into the kernels, and thus allow the code at any level to query information about the runtime instantiation of the current operation being executed, such as kernel addresses, microkernel storage preferences, and cache/register blocksizes.
- Contexts are thread-friendly. For example, consider the situation where a developer wishes two or more threads to execute simultaneously with somewhat different runtime parameters. Contexts also inherently promote thread-safety, such as in the event that the original source of the information stored in the context changes at run-time (see next two bullets).
- BLIS now consolidates virtually all kernel/hardware information in a new "global kernel structure" (gks) API. This new API will allow the caller to initialize a context in a thread-safe manner according to the currently active kernel configuration. For now, the currently active configuration cannot be changed once the library is built. However, in the future, this API will be expanded to allow run-time management of kernels and related parameters.
- The most obvious application of this new infrastructure is the run-time detection of hardware (and the implied selection of appropriate kernels). With contexts, kernels may even be "hot swapped" within the gks, and once execution begins on a level-3 operation, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If a different application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel info was loaded into its context before computation began, and also because the blocks it checked out from the memory pools will be unaffected by the newer threads' reinitialization of the allocator.
This version contains other changes that were committed prior to 537a1f4:
- Inline assembly FMA4 microkernels for AMD bulldozer. (Etienne Sauvage)
- A more feature-rich configure script and build system. Certain long-style options are now accepted, including convenient command-line switches for things like enabling debugging symbols. Important definitions were also consolidated into a new makefile fragment,
common.mk, which can be included by the BLIS build system as well as quasi-independent build systems, such as the BLIS test suite. (Devin Matthews) - Updated and improved armv8 microkernels. (Francisco Igual)
- Define
bli_clock()in terms ofclock_gettime()intead ofgettimeofday(), which has been languishing on my to-do list for years, literally. (Devin Matthews) - Minor but extensive modifications to parts of the BLAS compatibility layer to avoid potential namespace conflicts with external user code when
blis.his included. (Devin Matthews) - Fixed a missing BLIS integer type definition (
BLIS_BLAS2BLIS_INT_TYPE_SIZE) when CBLAS was enabled. Thanks to Tony Kelman reporting this bug. - Merged
packm_blk_var2()intopackm_blk_var1(). The former's functionality is used by induced methods for complex level-3 operations. (Field Van Zee) - Subtle changes to treatment of row and column strides in
bli_obj.cthat pertain to somewhat unusual use cases, in an effort to support certain situations that arise in the context of tensor computations. (Devin Matthews) - Fixed an unimplemented
beta == 0case in the penryn (formerly "dunnington")sgemmmicrokernel. (Field Van Zee) - Enhancements to the internal memory allocator in anticipation of the context retrofit. (Field Van Zee)
- Implemented so-called "quadratic" matrix partitioning for thread-level parallelism, whereby threads compute thread index ranges to produce partitions of roughly equal area (and thus computation), subject to the (register) blocksize multiple, even when given a structured rectangular subpartition with an arbitrary diagonal offset. Thanks to Devangi Parikh for reporting bugs related to this feature. (Field Van Zee)
- Enabled use of Travis CI for automatic testing of github commits and pull requests. (Xianyi Zhang)
- New
README.md, written in github markdown. (Field Van Zee) - Many other minor bug fixes.
Special thanks go to Lee Killough for suggesting the use of a "context" data structure in discussions that transpired years ago, during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.
Changes in 0.1.8
July 29, 2015
This release contains only two commits, but they are non-trivial: we now have configuration support for AMD Excavator (Carrizo) and microkernels for Intel Haswell/Broadwell.
Changes in 0.1.7
June 19, 2015
- Replaced the static memory allocator used to manage internal packing buffers with one that dynamically allocates memory, on-demand, and then recycles the allocated blocks in a software cache, or "pool". This significantly simplifies the memory-related configuration parameter set, and it completely eliminates the need to specify a maximum number of threads.
- Implemented default values for all macro constants previously found in
bli_config.h. The default values are now set inframe/include/bli_config_macro_defs.h. Any value #defined inbli_config.hwill override these defaults. - Initial support for configure-time detection of hardware. By specifying the
autoconfiguration at configure-time, the configure script chooses a configuration for you. If an optimized configuration does not exist, the reference implementation serves as a fallback. - Completely reorganized implementations for complex induced methods and added support for new algorithms.
- Added optimized microkernels for AMD Piledriver family of hardware.
- Several bugfixes to multithreaded execution.
- Various other minor tweaks, code reorganizations, and bugfixes.
Changes in 0.1.6
October 23, 2014
- New complex domain AVX microkernels are now available and used by default by the sandybridge configuration.
- Added new high-level 4m and 3m implementations presently known as "4mh" and "3mh".
- Cleaned up 4m/3m front-end layering and added routines to enable, disable, and query which implementation will be called for a given level-3 operation. The test suite now prints this information in its pre-test summary. 4m (not 4mh) is still the default when complex microkernels are not present.
- Consolidated control tree code and usage so that all level-3 multiplication operations use the same gemm_t structure, leaving only
trsmto have a custom tree structure and associated code. - Re-implemented micropanel alignment, which was removed in commit
c2b2ab6earlier this year. - Relaxed the long-standing constraint that
KCbe a multiple ofMR andNRby allowing the developer to specify target values and then adjusting them up to the next multiple ofMRorNR, as needed by the affected operations (hemm,symm,trmm, trsm). - Added a new "row preference" flag that the developer can use to signal to the framework that a microkernel prefers to output micro-tiles of C that are row-stored (rather than column-stored). Column storage preference is still the default.
- Changed semantics of blocksize extensions to instead be "maximum" blocksizes (and thus emphasizing the "extended" values rather than the difference).
- Various other minor tweaks, code reorganizations, and bugfixes.
Thanks go to those whose contributions, feedback, and bug reports led to these improvements--in particular, Tony Kelman, Kevin Locke, Devin Matthews, Tyler Smith, and perhaps others whose feedback I've lost track of.
Changes in 0.1.5
August 4, 2014
- Added a CBLAS compatibility layer, which can be enabled at configure-time via
BLIS_ENABLE_CBLASinbli_config.h. Enabling the CBLAS layer implicitly forces the BLAS compatibility layer to also be enabled. Once enabled, the application may access CBLAS prototypes viablis.horcblas.h. - Fixed a packing bug for cases when
MRorNR(or both) are 1. - Redefined bit field macros in
bli_type_defs.hwith bitshift operator to ease future rearranging, expanding, or adding of info bits.
Changes in 0.1.4
July 27, 2014
- Added shared library support to build system.
- Preliminary parallelization of
trsm(Tyler Smith). - Added generic
_void()microkernel wrappers so that users (or developers) can call the microkernel without knowing the implementation/developer-specific function names, which are specified at configure-time. - Added
bli_info_*()API for querying general information about BLIS, including blocksizes. - Reimplemented initialization/finalization for thread safety.
- Fixed a possible
Inf/NaNissue in several level-3 operations when beta is zero. - Minor fixes to BLAS compatibility layer.
- Added initial support for Emscripten (Marat Dukhan).
Changes in 0.1.3
June 23, 2014
This is a relatively minor release. The changes can be summarized as:
- Added experimental support for PNaCL (Marat Dukhan).
- Fixed aligned memory allocation on Windows (Tony Kelman).
- Fixed missing version string in build products when downloading tarballs/zip files (Field Van Zee). Thanks to Victor Eijkhout for pointing out this bug.
Changes in 0.1.2
June 2, 2014
Tyler has been hard at work developing and refining extensions to BLIS that provide multithreading support (currently via OpenMP, though POSIX threads may be supported in the future). These extensions enable multithreading within all level-3 operations except for trsm. We are pleased to announce that these code changes are now part of BLIS.
Changes in 0.1.1
February 25, 2014
I. I am excited to announce that BLIS now provides high-performance complex domain support to ALL level-3 operations when ONLY the same-precision real domain equivalent gemm microkernel is present and optimized. In other words, BLIS's productivity lever just got twice as strong: optimize the dgemm microkernel, and you will get double-precision complex versions of all level-3 operations, for free. Same for sgemm microkernel and single-precision complex.
II. We also now offer complex domain support based on the 3m method, but this support is ONLY accessible via separate interfaces. This separation is a safety feature, since the 3m method's numerical properties are inherently less robust. Furthermore, we think the 3m method, as implemented, is somewhat performance-limited on systems with L1 caches that have less than 8-way associativity.
We plan on writing a paper on (I) and (II), so if you are curious how exactly we accomplish this, please be patient and wait for the paper. :)
III. The second, user-oriented change facilitates a much more developer-friendly configuration system. This "change" actually represents a family of smaller changes. What follows is a list of those changes taken from the git log:
- We now have standard names for reference kernels (levels-1v, -1f and 3) in the form of macro constants. Examples:
BLIS_SAXPYV_KERNEL_REFBLIS_DDOTXF_KERNEL_REFBLIS_ZGEMM_UKERNEL_REF - Developers no longer have to name all datatype instances of a kernel with a common base name; [sdcz] datatype flavors of each kernel or microkernel (level-1v, -1f, or 3) may now be named independently. This means you can now, if you wish, encode the datatype-specific register blocksizes in the name of the microkernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left undefined in
bli_kernel.hwill default to the corresponding reference implementation. For example, ifBLIS_DGEMM_UKERNELis left undefined, it will be defined to beBLIS_DGEMM_UKERNEL_REF. - Developers no longer need to name level-1v/-1f kernels with multiple datatype chars to match the number of types the kernel WOULD take in a mixed type environment, as in
bli_dddaxpyv_opt(). Now, one char is sufficient, as inbli_daxpyv_opt(). - There is no longer a need to define an obj_t wrapper to go along with your level-1v/-1f kernels. The framework now provides a
_kernel()function, as inbli_axpyv_kernel(), which serves as theobj_twrapper for whatever kernels are specified (or defaulted to) viabli_kernel.h. - Developers no longer need to prototype their kernels, and thus no longer need to include any prototyping headers from within
bli_kernel.h. The framework now generates kernel prototypes, with the proper type signature, based on the kernel names defined (or defaulted to) viabli_kernel.h. - If the complex datatype x (of [cz]) implementation of the gemm microkernel is left undefined by
bli_kernel.h, but its same-precision real domain equivalent IS defined, BLIS will enable the automatic complex domain feature described above in (1a) for the datatype x implementations of all level-3 operations, using only the corresponding real domain gemm microkernel. If the complex gemm microkernel for x IS defined, then all complex level-3 operations will be defined in terms of that microkernel.
The net effect of (III) is that your bli_kernel.h files can be MUCH simpler and less cluttered. (Extreme example: the reference configuration's bli_kernel.h is now completely empty!) I have updated all configurations and kernels that are currently part of BLIS by stripping out unnecessary/outdated definitions and migrating existing definitions to their new names. (If you ever need to reference the complete list of options and macros, please refer to the bli_kernel.h inside the template configuration.) Please set aside some time to test and, if necessary, tweak the configurations which you originally developed and submitted. I may have broken some of them. If so, please accept my apologies and contact me for assistance. I will work with you to get them functional again.
The changes mentioned in (I), (II), and (III), along with all other changes since 0.1.0, are included BLIS 0.1.1 (fde5f1fd).
I know these changes may be a little disruptive to some, but I think that most developers will find the new complex functionality very useful, and the new configuration system much easier to use.
Changes in 0.1.0
November 9, 2013
- Added
sgemmmicrokernel for dunnington. - Added
dgemmmicrokernels and configurations for sandybridge, bgq, mic, power7, piledriver, loonson3a, which were used to gather performance data in our second ACM TOMS paper. Many thanks to Francisco Igual, Tyler Smith, Mike Kistler, and Xianyi Zhang for developing, testing, and contributing these kernels. - Migrated to signed integer for
dim_t,inc_t(to facilitate calling BLIS from Fortran). - Added "template" configuration and kernel set for developers to use as a starting point when developing new kernels from scratch.
- Improvements to test suite, including section overrides and standalone level-1f/level-3 kernel modules.
- Improvements to Windows build system (though it may still not yet be functional out-of-the-box). Thanks to Martin Schatz for his help here.
- Removed support for element "duplication" in level-3 macrokernels.
- Several bug fixes to BLAS compatibility layer. Thanks to Vladimir Sukharev for his numerous bug reports wrt the LAPACK test suite.
- Various other minor bugfixes.
Changes in 0.0.9
July 18, 2013
- A few algorithmic optimizations and bug fixes to
trmmandtrsm. - Parameter checking in the compatibility layer that mimics netlib BLAS.
- Default use of
stdint.htypes (int64_t,uint64_tby default). - Optional (and very much untested) C99 built-in complex type/arithmetic support.
Note that bli_config.h has changed since 0.0.8. Added configuration macros are:
#define BLIS_ENABLE_C99_COMPLEX
#define BLIS_ENABLE_BLAS2BLIS_INT64
#define PASTEF770(name) // ...
The first macro enables C99 built-in complex types. The second causes a Fortran integer to be defined as an int64_t (rather than int32_t). The third is a macro to name-mangle a full routine name for Fortran (ie: add an underscore) and should be obtained from config/reference/bli_config.h.
Changes in 0.0.8
June 12, 2013
This version includes several kernel optimizations and bug fixes.
While neither bli_config.h nor bli_kernel.h has changed formats since 0.0.7, make_defs.mk has changed, so please update your copy of this file when you git-pull. Specifically, we now define a new CFLAGS_KERNELS variable that allows one to use different compiler flags when compiling kernels. It works like this: At compile time, make will use CFLAGS_KERNELS to compile any source code that resides in any directory that begins with the name kernels. My recommendation is to simply apply this naming convention to the symbolic link to your kernels directory that resides in your configuration directory. Thanks to Tyler for suggesting this change.
Changes in 0.0.7
April 30, 2013
This version incorporates many small fixes and feature enhancements made during our SC13 collaboration.
Changes in 0.0.6
April 13, 2013
Several changes regarding memory alignment were made since 0.0.5, including modifications to bli_config.h. Also, this update fixes a few bugs.
Changes in 0.0.5
March 24, 2013
The most obvious change in this version is the migration to the bli function (and source code filename) prefix, from the old bl2 prefix, as well as a rename of the main BLIS header (blis2.h -> blis.h). The test suite seems to indicate that the change was successful.
A few other much more minor changes were made, one pertaining to a renamed constant in the _config.h file.
Changes in 0.0.4
March 15, 2013
The changes included in 0.0.4 mostly relate to the contiguous (static) memory allocator. The previous implementation was intended as a temporary solution that would work for benchmarking purposes, until enough other priorities had been tended to that I could go back and do it right.
I began with the assumption that the benefit of packing matrices into contiguous memory is non-negligible and worth the effort. Furthermore, we assume that:
- the only portable way to acquire contiguous memory is to reserve a region of static memory and manage it ourselves;
- the cache blocksizes used for one level-3 operation will be the same as those used for another level-3 operation, since all of them boil down to some form of matrix-matrix multiplication;
- only three types of contiguous memory will ever be needed (for level-3 operations): a block of matrix A, a panel of matrix B, or a panel of matrix C--and the last case is not commonly used;
- when a block or panel is to be acquired from the allocator, the caller knows which of the three types of memory is needed.
Given these assumptions, I was able to come up with an implementation that is simple, easy to understand, and thread-safe (provided you add OpenMP directives to protect the critical sections, which are clearly marked with comments). It can also both allocate and release in O(1) time. And of course, page-alignment is taken care of behind the scenes. So while it is not a generalized solution by any means, I think it will work very well for our purposes.
Also, note that based on the level of the overall matrix multiplication algorithm at which you parallelize, the minimum number of blocks/panels of each type of contiguous memory will vary. For example, if you want all of your threads to work on different iterations of a single rank-k update (via block-panel multiply), the threads share the packed panel of B, but each one needs memory to hold its own packed block of A. Thus, the memory allocator needs to be initialized so that it contains enough memory for at least one panel of B and at least t blocks of A, where t is the number of threads. All of this can be adjusted at configure-time in bl2_config.h.
Changes in 0.0.3
February 22, 2013
The biggest change in this version is that the BLAS-to-BLIS compatibility layer is now available. Virtually every BLAS interface is included, even those corresponding to functionality that BLIS does not implement (such as banded and packed level-2 operations). If the application code attempts to call one of these unimplemented routines, the code aborts with a generic not-yet-implemented error message.
The compatibility layer is enabled via a configuration option in bl2_config.h. For now, it is enabled by default (provided you have an up-to-date copy of bl2_config.h).
Changes in 0.0.2
February 11, 2013
Most notably, this version contains the new test suite I've been working on for the last month.
What is the test suite? It is a highly configurable test driver that allows one to test an arbitrary set of BLIS operations, with an arbitrary set of parameter combinations, and matrix/vector storage formats, as well as whichever datatypes you are interested in. (For now, only homogeneous datatyping is supported, which is what most people want.) You can also specify an arbitrary problem size range with arbitrary increments, and arbitrary ratios between dimensions (or anchor a dimension to a single value), and you can output directly to files which store the output in matlab syntax, which makes it easy to generate performance graphs.
BLIS developers: note that 0.0.2 makes small changes to the configuration files. This new version also contains many bug fixes. (Most of these fixes address bugs which were found using the test suite.)
Changes in 0.0.1
December 10, 2012
- Added auto-detection of string version (via
git). - Wrote basic INSTALL, CHANGELOG, AUTHORS, and CREDITS files.
- Updates to standalone
testdirectoryMakefile. - Added initial build system
- Various code reorganizations.