Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
Implemented and registered power9 dgemm ukernel.
Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
inadvertantly not incremented when the Zen2 subconfiguration was
added.
- In bli_gemm_front(), added a missing conditional constraint around the
call to bli_gemm_small() that ensures that the computation precision
of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
that existed around the call to bli_syrk_small() into bli_syrk_small()
to minimize the calling code footprint and also to bring that code
into stylistic harmony with similar code in bli_gemm_front() and
bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
proper accessor static functions (e.g. 'a->dim[0]' becomes
'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
strictly speaking unnecessary, but it serves as a useful visual cue to
those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
version check for availability of -march=znver2, and added appropriate
support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
config/zen/amd_config.mk, including: removal of -march=znver1 et al.
from CKVECFLAGS (since the -march flag is added within make_defs.mk);
setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
Details:
- Added the pack_t schema argument to the knl packm kernel functions.
This change was intended for inclusion in 31c8657. (Thank you SDE +
Travis CI.)
Updated copyright information for kernels/zen/bli_trsm_small.c file
Removed separate kernels for zen2 architecture
Instead added threshold conditions in zen kernels both for ROME and NAPLES
Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5
Details:
- Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
that only affected the beta == 0, column-storage output case. Thanks
to the BLAS test drivers for catching this bug.
- Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
k = 0, when the correct action would be to scale by beta (and then
return). Thanks to the BLAS test drivers to catching this bug.
- Changed the sup threshold behavior such that the sup implementation
only kicks in if a matrix dimension is strictly less than (rather than
less than or equal to) the threshold in question.
- Initialize all thresholds to zero (instead of 10) by default in
ref_kernels/bli_cntx_ref.c. This, combined with the above change to
threshold testing means that calls to BLIS or BLAS with one or more
matrix dimensions of zero will no longer trigger the sup
implementation.
- Added disabled debugging output to frame/3/bli_l3_sup.c (for future
use, perhaps).
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
Details:
- Renamed
kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c
to
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c.
This follows the naming convention used by other kernel sets, most
notably haswell.
Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
resulting in output written to the full C matrix, and C being symmetric the
lower and upper triangles of C matrix contained same results. BLAS SYRK API
spec demands either lower or upper triangle of C matrix to be written with
results. So, this was resulting in BLAS test failures, even though testsuite
of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
implemented for small SYRK for both single and double precision. The newly
added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
Now the intermediate results of matrix C are written to a scratch buffer.
Final results are written from scratch buffer to matrix C using SIMD
copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.
Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb
Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:
void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);
where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
specifically written to overcome the floating-point latency for FMA
instructions on Intel Haswell-like architectures, which can issue two
FMA instructions per cycle. These ukernels happen to work fine on AMD
Zen-based architectures. However, Zen only issues one FMA per cycle,
which, while halving its floating-point throughput, gives it extra
flexibility in the design of its microkernels--namely, mr and nr can
be smaller and still overcome the floating-point latency for those
single-issue cores. A smaller value of mr and nr allows for a larger
value of kc, which may be useful in some situations. In the future,
we may write such Zen-specific microkernels to take advantage of this
additional flexibility.
Details:
- Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux
branch, along with appropriate blocksizes in bli_cntx_init_skx.c and
a prototype in bli_kernels_skx.h. (Devin has not yet written the
sgemm analague, so for now we will continue using the older sgemm
ukernel.)
- Updated frame/include/bli_x86_asm_macros.h with a minor change that
was present within the skx-redux branch.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
Details:
- Updated older _ft kernel type suffixes used within penryn level-1v
and -1f kernels to use the newer _ker_ft suffix that was introduced
in 0175483. (Thank you Travis CI.)
Details:
- Previously, most object API functions (_oapi.c) used a function
chooser macro that would expand out to an if-elseif-elseif-else
conditional that used a num_t datatype to call the appropriate
type-specific API (_tapi.c). This always felt a little hackish, and
would get in the way somewhat of addig support for new num_t datatypes
in the future. So, I've replaced that functionality with code that
queries a function pointer that is then typecast appropriately. This
model of function calling was already pervasive for kernels queried
from the cntx_t structure. It was also already in use in various other
functions, such as macrokernels, and this commit simply extends that
pattern.
- The above change required many new files, mostly header files, that
define the function types (mostly _ft.h) for the queriable functions
as well as some source files to define the function pointer arrays and
their corresponding query functions (_fpa.c). Various other function
types, mostly for kernel function types, were renamed to reduce the
potential for confusion with the function types for expert and basic
(non-expert) typed API functions.
- Removed definitions for all of the "bli_call_ft_*()" function chooser
macros from bli_misc_macro_defs.h.
Details:
- Fixed stale calls to dscalv() from the dotxf and dotxaxpyf penryn
kernels that were not updated during the basic/expert API separation
in e88aeda.
* Add appveyor file
* Build script
* Remove fPIC for now
* copy as
* set CC and CXX
* Change the order of immintrin.h
* Fix testsuite header
* Move testsuite defs to .c
* Fix appveyor file
* Remove fPIC again and fix strerror_r missing bug
* Remove appveyor script
* cd to blis directory
* Fix sleep implementation
* Add f2c_types_win.h
* Fix f2c compilation
* Remove rdp and rename appveyor.yml
* Remove setenv declaration in test header
* set CPICFLAGS to empty
* Fix another immintrin.h issue
* Escape CFLAGS and LDFLAGS
* Fix more ?mmintrin.h issues
* Build x86_64 in appveyor
* override LIBM LIBPTHREAD AR AS
* override pthreads in configure
* Move windows definitions to bli_winsys.h
* Fix LIBPTHREAD default value
* Build intel64 in appveyor for now
Details:
- Changed the void* arguments of the following static functions:
bli_is_aligned_to()
bli_is_unaligned_to()
bli_offset_past_alignment()
to siz_t, and the return type of bli_offset_past_alignment() from
guint_t to siz_t. This allows for more versatile usage of these
functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
but also in kernels/bgq, to include explicit typecasts to siz_t when
pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
#211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
various kernel license headers, likely from a malformed recursive sed
performed long ago.
Details:
- Inserted missing safeguards into most microkernels to ensure that the
integers read by the microkernel's assembly instructions are of the
appropriate size. In many cases, this bug was going undetected likely
because the compiler was inserting zero padding before the integers
in the calling function, allowing the assembly code to read 64-bits
in a way that did not corrupt the "lower" 32 integer bits with garbage
in the higher bits. Thanks to Francisco Igual and Devangi Parikh for
finding this issue.