Details:
- Implemented the "beta == 0" case for general stride output for the
dunnington sgemm micro-kernel. This case had been, up until now,
identical to the "beta != 0" case, which does not work when the
output matrix has nan's and inf's. It had manifested as nan residuals
in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
Matthews for reporting this bug.
Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
and FMA instructions. (Complex support is currently provided by default
induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
Details:
- Added new micro-kernels for the AMD piledriver architecture (one
for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
acknowledging the influence of the corresponding kernel code in
OpenBLAS.
Details:
- We actually need to check alignment of lda*sizeof(double) and NOT
a+lda because in the latter case, alignment could cancel out and
still allow the optimized code to run when it shouldn't. Thanks
to Devin for pointing this out.
Details:
- The bugfix in a68b316c was inadvertantly checkin alignment of the
leading dimension itself, rather than the byte size of the leading
dimension. Now, we simply check alignment of a+lda.
Details:
- Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels
were attempting to compute problems with unaligned leading dimensions
with optimized code, rather than (correctly) using the reference
implementations. Thanks to Devin Matthews for reporting this bug.
Details:
- Minor updates to bli_config and bli_kernel.h for sandybridge
configuration.
- Renamed existing AVX intrinsic-based micro-kernel file to
bli_gemm_int_d8x4.c.
- Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based
gemm micro-kernels for single- and double-precision real.
Details:
- Reverted some changes that were unintentionally included in the
previous commit (9526ce98). Thanks to Tony Kelman for pointing
this out. (Note: a few select changes were not reverted.)
Details:
- Reverted two symlinks, in kernels/power7/3/test, back to being symlinks
after recursive-sed.sh mistakenly replaced them with copies of the
actual files to which they referred. Meant to include this in previous
commit.
Details:
- Updated copyright headers to include "at Austin" in the name of the
University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
2012).
Details:
- Fixed a bug in the dunnington/core2 gemm micro-kernels that resulted in
a segmentation fault if a column-stored matrix's starting address was
aligned, but its leading dimension was such that its second column was
unaligned. Basically, the micro-kernel was assuming that aligned load
instructions were safe when they actually were not. An extra condition
that checks the alignment of cs_c (ie: the leading dimension in the
column storage case) has now been added. Thanks to Michael Lehn for
reporting this bug.
Details:
- Fixed improper usage of restrict keyword in axpyv and dotv bgq kernels.
(However, there may be other instances of similar misuse elsewhere in
BLIS.) Thanks to Jeff Hammond for reporting this issue.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
macro constants. Examples:
BLIS_SAXPYV_KERNEL_REF
BLIS_DDOTXF_KERNEL_REF
BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
with a common base name; [sdcz] datatype flavors of each kernel or
micro-kernel (level-1v, -1f, or 3) may now be named independently.
This means you can now, if you wish, encode the datatype-specific
register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
undefined in bli_kernel.h will default to the corresponding reference
implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
datatype chars to match the number of types the kernel WOULD take in
a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
your level-1v/-1f kernels. The framework now prvides a _kernel()
function which serves as the obj_t wrapper for whatever kernels are
specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
longer need to include any prototyping headers from within
bli_kernel.h. The framework now generates kernel prototypes, with the
proper type signature, based on the kernel names defined (or defaulted
to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
kernel is left undefined by bli_kernel.h, but its same-precision real
domain equivalent IS defined, BLIS will use a 4m-based implementation
for the datatype x implementations of all level-3 operations, using
only the real gemm micro-kernel.
Details:
- Added the ability to induce complex domain level-3 operations via new
virtual complex micro-kernels which are implemented via only real
domain micro-kernels. Two new implementations are provided: 4m and 3m.
4m implements complex matrix multiplication in terms of four real
matrix multiplications, where as 3m uses only three and thus is
capable of even higher (than peak) performance. However, the 3m method
has somewhat weaker numerical properties, making it less desirable
in general.
- Further refined packing routines, which were recently revamped, and
added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.