Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
the rntm to be copied locally when it is passed in via one of the
expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:
void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);
where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.
Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
datatypes (along with the computation datatype) are equal. Now, 1m may
be used as long as all operands are stored in the complex domain. This
change largely consisted of adding the ability to pack to 1e and 1r
formats from one precision to another. It also required adding logic
for handling complex values of alpha to bli_packm_blk_var1_md()
(similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
ukernel output preference field being read. Previously, the preference
for the native complex ukernel was being read instead of the pref for
the native real domain ukernel. This bug would not manifest if the
preference for the native complex ukernel happened to be equal to that
of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
schemas are always read from the context, rather than trying to
sometimes embed them directly to the A and B objects. (They are still
embedded, but now uniformly only after reading the schemas from the
context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
scal2ris, and xpbyris (and their conjugating counterparts) to
explicitly support three operand types and updated invocations to
xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.
Details:
- Adjusted the definition for libblis_test_get_string_for_result() in
testsuite/src/test_libblis.c so that the "FAIL" string is returned if
the computed residual contains either NaN or Inf. Previously, a
residual containing NaN would result in the selection of the "PASS"
string. Thanks to Devin Matthews for reporting this issue (#279).
- Expounded on comment for the macro definitions of bli_isnan() and
bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
must remain macros.
Details:
- Updated docs/Testsuite.md to include mention of the new mixed-domain
and mixed-precision settings, including descriptions.
- Updated docs/MixedDatatypes.md to include a brief section on running
the testsuite to exercise mixed-datatype functionality, which mostly
amounts to a link to the Testsuite.md document.
- Minor verbiage change to testsuite output to correct a misleading
label associated with the value returned by the query function
bli_info_get_simd_num_registers(). (The function does not return the
number of SIMD registers present in the hardware, but rather a maximum
assumed value for the purposes of allocating temporary microtile
workspace on the function stack.)
Details:
- Print an error message from configure if the user attempts to
explicitly configure BLIS for simultaneous use of 64-bit integers in
the BLAS API with 32-bit integers in the BLIS API.
- Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
integers be 64 bits if the BLAS integers are 64 bits. This and the
above item take care of issue #274. Thanks to Devin Matthews and
Jeff Hammond for suggesting these safeguards.
- Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
sections and BLIS integer size line of the testsuite configuration
output.
- Very minor edits to docs/MixedDatatypes.md.
Details:
- Modified .travis.yml to automatically employ the simulation of
application-level threading within the testsuite, with supporting
changes to common.mk, the top-level Makefile, and
travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
'.salt' suffix (similar to those with the '.fast' suffix) for
testing application-level threading.
- Updated docs/BuildSystem.md to document the new make targets
'testblis-salt' and 'checkblis-salt'.
Details:
- Modified .travis.yml to automatically test the mixed-datatype support
of the gemm operation, with supporting changes to common.mk, the
top-level Makefile, and travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
'.mixed' suffix (similar to those with the '.fast' suffix) for testing
mixed-datatype gemm.
- Updated docs/BuildSystem.md to document the new make targets
'testblis-md' and 'checkblis-md'.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
against a Windows DLL, will be able to access pthreads functionality
without those pthreads functions being explicitly exported by the DLL.
Instead, we export the bli_pthread_*() layer, which uses types and
functions that are identical to pthreads, but adds a 'bli_' prefix.
Only a few basic functions are present in the bli_pthreads_*() API
for now. Thanks to Devin Matthews and Isuru Fernando for their help
on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
typed APIs. This operation adds a scalar value to every element along
an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
terms of the addv kernel. (The scalar is passed in as the x vector
with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
matrix object) with use of shiftd, which is much more concise, in
various test driver files in the testsuite. Similar changes were made
to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
BLISObjectAPI.md.
Details:
- Applied the following C preprocessor macro renames
BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
in src/test_libblis.c. This is apparently the result of a failure by
git to properly merge the 'master' and 'amd' branches in the previous
commit. (The 'master' branch contained a commit, 53a9ab1, in which
these same cpp macros were renamed throughout the source distribution.
Details:
- Implemented support for gemm where A, B, and C may have different
storage datatypes, as well as a computational precision (and implied
computation domain) that may be different from the storage precision
of either A or B. This results in 128 different combinations, all
which are implemented within this commit. (For now, the mixed-datatype
functionality is only supported via the object API.) If desired, the
mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
that requires a single m-by-n matrix be allocated (temporarily) per
call to gemm. This optimization aims to avoid the overhead involved in
repeatedly updating C with general stride, or updating C after a
typecast from the computation precision. This memory optimization may
be disabled at configure-time (provided that the mixed-datatype
support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
The user may test gemm with mixed domains, precisions, both, or
neither.
- Added a standalone test driver directory for building and running
mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
except that imaginary values are not touched when casting a real
operand to a complex operand. (By contrast, in these situations castm
sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
accessor functions. Also commented out all usage of accessor
functions within macrokernels. (Typecasting in the microkernel is
still feasible, though probably unrealistic for now given the
additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
(courtsey of Devin Matthews).
Details:
- Updated testsuite to output various parameters related to parallelism
in BLIS. These parameters include:
- threading status: disabled, openmp, or pthreads;
- thread partitioning for jr/ir loops: slab or rr (round-robin);
- ways of parallelism from environment variables, and also actual
values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
square problems (assuming all dimensions are set to 1000);
- automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
libmemkind and the sandbox.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
unconditionally, but differently for Windows and non-Windows, but
then disabled the definition of bli_setenv() entirely since BLIS
no longer needs to set environment variables. Updated bli_winsys.h
accordingly, and call bli_sleep() from within testsuite instead of
sleep() directly.
- Use
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
instead of
#if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
when guarding against local definition of pthread barrier in
testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
should always be set to 200809L when barriers are supported, though I
won't be surprised if we encounter a case in the future where it is
set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
top-level Makefile. These are no longer needed because we no longer
output libraries with the version and configuration name as
substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
Details:
- Tweaked the cflags functions in common.mk so that a new preprocessor
macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS
itself is being built. This macro will not be defined when, for
example, the testsuite or example code compiles code local to those
applications. This was done in part by defining a new cflags function
get-user-cflags-for(), which is now the designated function for
application Makefiles if they wish to inherit a basic set of CFLAGS
from BLIS. (The compiler flags returned are identical to that of
get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is
omitted.)
- Updated all test driver-like makefiles to call get-user-cflags-for()
instead of get-frame-cflags-for().
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
Details:
- Added lines associated with the testsuite's new threading option to
input.general.fast. This change was intended for the previous commit
(10d0735).
Details:
- Replaced critical sections that were conditional upon multithreading
being enabled (via pthreads or OpenMP) with unconditional use of
pthreads mutexes. (Why pthreads? Because BLIS already requires it
for its initialization mechanism: pthread_once().) This was done in
bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
mtx_t object and bli_mutex_*() API with pthread mutexes in
bli_thread.c. The previous status quo could result in a race condition
if the application called BLIS from more than one thread. The new
pthread-based code should be completely agnostic to the application's
threading configuration. Thanks to AMD for bringing to our attention
the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
multithreading. Specifically, each thread maintains a counter that is
incremented after each experiment. The thread only executes the
experiment if: counter % n_threads == thread_id. In other words, the
threads simply take turns executing each problem experiment. Also,
POSIX guarantees that fprintf() will not intermingle output, so
output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
a race condition; specifically, two threads calling the function with
the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
to (and updating existing lines of) source files.
Details:
- Updated the build system so that "lesser" Makefiles, such as those in
belonging to example code or the testsuite, may be run even if the
directory is orphaned from the original build tree. This allows a
user to configure, compile, and install BLIS, delete the build tree
(that is, the source distribution, or the build directory for out-
of-tree builds) and then compile example or testsuite code and link
against the installed copy of BLIS (provided the example or testsuite
directory was preserved or obtained from another source). The only
requirement is that make be invoked while setting the
BLIS_INSTALL_PATH variable to the same installation prefix used when
BLIS was configured. The easiest syntax is:
make BLIS_INSTALL_PATH=/install/prefix
though it's also permissible to set BLIS_INSTALL_PATH as an
environment variable prior to running 'make'.
- Updated all lesser Makefiles to implement the new aforementioned build
behavior.
- Relocated check-blastest.sh and check-blistest.sh from build to
blastest and testsuite, respectively, so that if those directories are
copied elsewhere the user can still run 'make check' locally.
- Updated docs/Testsuite.md with language that mentions this new option
of building/linking against an installed copy of BLIS.
Details:
- Previously, most object API functions (_oapi.c) used a function
chooser macro that would expand out to an if-elseif-elseif-else
conditional that used a num_t datatype to call the appropriate
type-specific API (_tapi.c). This always felt a little hackish, and
would get in the way somewhat of addig support for new num_t datatypes
in the future. So, I've replaced that functionality with code that
queries a function pointer that is then typecast appropriately. This
model of function calling was already pervasive for kernels queried
from the cntx_t structure. It was also already in use in various other
functions, such as macrokernels, and this commit simply extends that
pattern.
- The above change required many new files, mostly header files, that
define the function types (mostly _ft.h) for the queriable functions
as well as some source files to define the function pointer arrays and
their corresponding query functions (_fpa.c). Various other function
types, mostly for kernel function types, were renamed to reduce the
potential for confusion with the function types for expert and basic
(non-expert) typed API functions.
- Removed definitions for all of the "bli_call_ft_*()" function chooser
macros from bli_misc_macro_defs.h.
Details:
- Changed the format specifier for the gflops column in the testsuite
output from %7.3f to %7.2f. This was done mainly to keep the output
aligned properly when the expected perfomance exceeded 1000 gflops.
Also, two decimal places still conveys plenty of precision for all
practical applications, including just eyeballing performance deltas
between two executions (let alone two implementations).
- Changed the format specifier for gflops in the test/3m4m drivers
from %6.3f to %7.2f (for the same reasons listed above).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
field of the cntx_t (context). The thrloop array holds the number of
ways of parallelism (thread "splits") to extract per level-3
algorithmic loop until those values can be used to create a
corresponding node in the thread control tree (thrinfo_t structure),
which (for any given level-3 invocation) usually happens by the time
the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
when invoking level-3 operations from two or more application threads.
The race condition existed because the cntx_t, a pointer to which is
usually queried from the global kernel structure (gks), is supposed to
be a read-only. However, the previous code would write to the cntx_t's
thrloop field *after* it had been queried, thus violating its read-only
status. In practice, this would not cause a problem when a sequential
application made a multithreaded call to BLIS, nor when two or more
application threads used the same parallelization scheme when calling
BLIS, because in either case all application theads would be using
the same ways of parallelism for each loop. The true effects of the
race condition were limited to situations where two or more application
theads used *different* parallelization schemes for any given level-3
call.
- In remedying the above race condition, the application or calling
library can now specify the parallelization scheme on a per-call basis.
All that is required is that the thread encode its request for
parallelism into the rntm_t struct prior to passing the address of the
rntm_t to one of the expert interfaces of either the typed or object
APIs. This allows, for example, one application thread to extract 4-way
parallelism from a call to gemm while another application thread
requests 2-way parallelism. Or, two threads could each request 4-way
parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
of the level-3 implementation stack (with the most notable exception
being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
APIs. (A few internal functions gained the rntm_t* parameter even
though they currently have no use for it, such as bli_l3_packm().)
This required some internal calls to some of those functions to
be updated since BLIS was already using those operations internally
via the expert interfaces. For situations where a rntm_t object is
not available, such as within packm/unpackm implementations, NULL is
passed in to the relevant expert interfaces. This is acceptable for
now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only
read once, at library initialization. (Thanks to Nathaniel Smith for
suggesting this to avoid repeated calls getenv(), which can be slow.)
Those values are recorded to a global rntm_t object. Public APIs, in
bli_thread.c, are still available to get/set these values from the
global rntm_t, though now the "set" functions have additional logic
to ensure that the values are set in a synchronous manner via a mutex.
If/when NULL is passed into an expert API (meaning the user opted to
not provide a custom rntm_t), the values from the global rntm_t are
copied to a local rntm_t, which is then passed down the function stack.
Calling a basic API is equivalent to calling the expert APIs with NULL
for the cntx and rntm parameters, which means the semantic behavior of
these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
and reimplemented, with the function now being able to treat the
incoming rntm_t in a manner agnostic to its origin--whether it came
from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
as the corresponding "get" functions. The new model simplifies these
interfaces so that one must either set the total number of threads, OR
set all of the ways of parallelism for each loop simultaneously (in a
single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
(and two specific ways within each method) of requesting parallelism
in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
* Add appveyor file
* Build script
* Remove fPIC for now
* copy as
* set CC and CXX
* Change the order of immintrin.h
* Fix testsuite header
* Move testsuite defs to .c
* Fix appveyor file
* Remove fPIC again and fix strerror_r missing bug
* Remove appveyor script
* cd to blis directory
* Fix sleep implementation
* Add f2c_types_win.h
* Fix f2c compilation
* Remove rdp and rename appveyor.yml
* Remove setenv declaration in test header
* set CPICFLAGS to empty
* Fix another immintrin.h issue
* Escape CFLAGS and LDFLAGS
* Fix more ?mmintrin.h issues
* Build x86_64 in appveyor
* override LIBM LIBPTHREAD AR AS
* override pthreads in configure
* Move windows definitions to bli_winsys.h
* Fix LIBPTHREAD default value
* Build intel64 in appveyor for now
Details:
- Removed three input lines from input.operations.fast (labeled
"test sequential micro-kernel") that I intended to remove in bd02c4e.
These lines prevented 'make check' (and 'make checkblis-fast') from
completing correctly. Note: This bug was fixed in 3df39b3, but that
commit has not yet been merged into master, hence this redundant
commit. Thanks to Robert van de Geijn for reporting this issue.
Details:
- Removed the line in each operation entry in input.operations titled
"test sequential front-end" and the corresponding support for the lines
in the testsuite input parsing code. This line was included in the some
of the earliest versions of the testsuite, back when I intended to
eventually have separate multithreaded APIs. Specifically, I envisioned
that multithreaded and sequential testing could be enabled or disabled
on an operation level. However, BLIS evolved in a different direction
and still does not have multithreaded-specific APIs (even if it will
eventually someday). But even if it did have such APIs, I doubt I would
allow the user to enable/disable them on an operation level. Thus, this
was a zombie future parameter that was never used and never made sense
to begin with. The one instance of the front_seq variable, used in the
various libblis_test_<operation>() functions to guard the call to the
operation test driver, that remains was commented out instead of
deleted so that someday it could be easily changed via sed, if desired.
- Various minor cleanups to the testsuite code, including consolidating
use of DISABLE and DISABLE_ALL and reexpressing certain conditional
expressions in the libblis_test_<operation>() functions in terms of
boolean functions.
Details:
- Renamed the following variables in config.mk (via build/config.mk.in):
BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE
BLIS_ENABLE_STATIC_BUILD -> MK_ENABLE_STATIC
BLIS_ENABLE_SHARED_BUILD -> MK_ENABLE_SHARED
BLIS_ENABLE_BLAS2BLIS -> MK_ENABLE_BLAS
BLIS_ENABLE_CBLAS -> MK_ENABLE_CBLAS
BLIS_ENABLE_MEMKIND -> MK_ENABLE_MEMKIND
and also renamed all uses of these variables in makefiles and makefile
fragments. Notice that we use the "MK_" prefix so that those variables
can be easily differentiated (such as via grep) from their "BLIS_" C
preprocessor macro counterparts.
- Other whitespace changes to build/config.mk.in.
- Renamed the following C preprocessor macros in bli_config.h (via
build/bli_config.h.in):
BLIS_ENABLE_BLAS2BLIS -> BLIS_ENABLE_BLAS
BLIS_DISABLE_BLAS2BLIS -> BLIS_DISABLE_BLAS
BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE
and also renamed all relevant uses of these macros in BLIS source
files.
- Renamed "blas2blis" variable occurrences in configure to "blas", as
was done in build/config.mk.in and build/bli_config.h.in.
- Renamed the following functions in frame/base/bli_info.c:
bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas()
bli_info_get_blas2blis_int_type_size()
-> bli_info_get_blas_int_type_size()
- Remove bli_config.h during 'make cleanh' target of top-level Makefile.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
Details:
- Added logic to common.mk that will choose a BLIS library against which
to link (LIBBLIS_LINK). The default choice is the static (.a) library;
the shared (.so) library is chosen only if the shared library build was
enabled and the static one was disabled.
- Updated the various test driver Makefiles to reference this common,
pre-chosen library against which to link. (Previously, these drivers
unconditionally linked against the static library and would have
failed if the static library build was disabled at configure-time.)
- Renamed many of the variables in common.mk and the top-level Makefile
so that variables relating to the libblis.[a|so] files, including
paths to those files, begin with "LIBBLIS".
- Shuffled around some of the library definitions from the top-level
Makefile to common.mk.
- Renamed BLIS_ENABLE_DYNAMIC_BUILD to BLIS_ENABLE_SHARED_BUILD, and
the @enable_dynamic@ anchor to @enable_shared@ in build/config.mk.in
and in configure.
- A few other cleanups in the top-level Makefile.
Details:
- Added an 'input.operations.fast' file to testsuite directory to go
along with the 'input.general.fast' file used by the 'make check'
target in the top-level Makefile. This will allow the "fast" check
to prune operations and/or parameter combinations from the test
space in order to save time.
- Currently, input.operations.fast prunes trmm3 and all transposition
and conjugation parameters from the level-3 test space.
- Reduced problem size tested in input.general.fast to 100 and disabled
testing of 1m method.
Details:
- Created a new test suite that exercises only the BLAS compatibility
found in BLIS. The test suite is a straightforward port of code
obtained from netlib LAPACK, run through f2c and linked to a stripped-
down version of libf2c that is compiled along with the test drivers
(to prevent any obvious ABI issues). The new BLAS test suite can be
run from within its new local directory, 'blastest' (through its local
'make ; make run' targets) or from the top-level Makefile (via the
'make testblas' target). Output files are created in whatever directory
the test drivers are run, whether it be the 'blastest' directory, the
top-level source distribution directory, or the out-of-tree directory
in which 'configure' was run. Also, the results of the BLAS test suite
can be checked via 'make checkblas', which summarizes the presence or
absence of test failures in a single line printed to stdout.
- Updated the 'test' target to run both 'testblis' and 'testblas'.
- Added a new 'testblis-fast' target that runs the BLIS testsuite with
smaller problem sizes, allowing it to finish more quickly.
- Added a 'make check' target, which runs 'checkblis-fast' and
'checkblas'.
- Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of
'testblis' before (calling the check-blistest.sh script to check the
result manually).
- Renamed some targets in the top-level Makefile to be consistent between
BLAS and BLIS.
Details:
- Mistakenly was calling bli_cpuid_query_id() instead of
bli_arch_query_id() in the recent addition to the testsuite output
that prints the active sub-configuration. The former function is
only used for multi-architecture builds, whereas the latter is the
more general option that also works for single configuration
(including 'configure auto') builds.
Details:
- Added a line to the testsuite output that prints the name of the
current/active sub-configuration. This is useful when linking the
testsuite against multi-configuration builds because it confirms
the sub-configuration that is actually being employed at runtime.
Thanks to Devin Matthews for suggesting this feature.
Details:
- Change "vector storage schemes to test" parameter in testsuite's
input.general file to "cj". This means that both unit stride column
vectors and non-unit stride column vectors will be tested in
operations with vector operands (e.g. level-1v, level-1f, level-2).
- Very minor comment (typo) changes to input.operations.
Details:
- Updated the testsuite driver so that setting one or more individual
operation test switches to "2" in input.operations will enable ONLY
those operations and disable all others, regardless of the values of
the section overrides and other operation switches. This makes it
every easy to quickly test only one or two operations, and equally
easy to revert back to the previous combination of operation tests.
- Added more comments to input.operations describing the use of
individual "enable only" overrides.
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
Special thanks to AMD for their contributions to-date, especially with
regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
the extra cost of transposing the microtile in registers, this is
much faster than using the general storage case when the underlying
matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
- only one vzeroupper instruction is called prior to returning
- movapd/movupd are used in leiu of movaps/movups for double-real
microkernels. (Note that single-real microkernels still use
movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
comments.
- Updated LICENSE file to explicitly mention that parts are copyright
UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.
Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
sum of the squares via dotv(). If there is a floating-point exception
(FE_OVERFLOW), then the previous (numerically conservative) code is
used; otherwise, the result of dotv() is square-rooted and stored as
the result. This new implementation is only enabled when FE_OVERFLOW
is #defined. If the macro is not #defined, then the previous
implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.