- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
Details:
- Created a set of single-precision real millikernels and microkernels
comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
file in test/sup. This included edits that allow for separate "small"
dimensions for single- and double-precision as well as for single-
vs. multithreaded execution.
Details:
- Updated Makefiles in test, test/3, and test/sup so that running any of
the usual targets without having first built BLIS results in a helpful
error message. For example, if BLIS is not yet configured, make will
output:
Makefile:327: *** Cannot proceed: config.mk not detected! Run
configure first. Stop.
Similarly, if BLIS is configured but not yet built, make will output:
Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
make first. Stop.
In previous commits, these actions would result in a rather cryptic
make error such as:
make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
needed by 'blis-nat-st'. Stop.
Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
building matrices with small or large leading dimensions, and updated
runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
environment variables), and for capturing output to files that encode
both the leading dimension (small or large) and packing status into
the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
into test/sup/octave and updated the octave code in that consolidated
directory to read the new output filename format (encoding ldim and
packing). Also added comments and streamlined code, particularly in
plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
different problem size ranges for the drivers where one, two, or three
matrix dimensions is large. This will facilitate the generation of
more meaningful graphs, particularly when two dimensions are tiny.
Details:
- Fixed an off-by-one bug in the output of matlab row indices in
test/sup/test_gemm.c that only manifested when the problem size
increment was equal to 1.
- Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage
combinations for blissup drivers in test/sup. This helps make the
building of drivers complete sooner.
- Trivial changes to test/sup/runme.sh.
Details:
- Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is
needed because libxsmm currently only optimizes gemm operations where
alpha is unit (and beta is unit or zero).
- Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its
fallback library. This is the library that will be called the
problem dimensions are deemed too large, or any other criteria for
optimization are not met. (This was done not because it is realistic,
but rather so that it would be very clear when libxsmm ceased handling
gemm calls internally when the data are graphed.)
Details:
- Switch the driver source in test/sup so that libxsmm_?gemm() is called
instead of ?gemm_() when compiling for / linking against libxsmm.
libxsmm's documentation isn't clear on whether it is even *trying* to
provide BLAS API compatibility, and I got tired of trying to figure it
out.
- Added missing -ldl in LDFLAGS when linking against libxsmm.
Details:
- Modified test/sup/Makefile to build drivers that test the performance
of skinny/small problems via libxsmm.
- Modified test/sup/runme.sh to run aforementioned drivers.
- Modified test/sup/test_gemm.c so that problem sizes are tested in
reverse order (from largest to smallest). This can help avoid certain
situations where the CPU frequency does not immediately throttle up
to its maximum. Thanks to Robert van de Geijn for recommending this
fix.
Details:
- Changed starting problem and increment from 16 to 4.
- Added 'lll' (square problems) to list of problem size shapes to
compile and run with.
- Define BLASFEO location and added BLASFEO-related definitions.
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.