Details:
- Fixed a linker error that occurred when attempting to compile and link
the testsuite and/or BLAS test drivers after having configured BLIS to
only generate a shared library (no static library). The chosen
solution involved
(1) adding the local library path, $(BASE_LIB_PATH), to the search
paths for the shared library via the link option
-Wl,-rpath,$(BASE_LIB_PATH).
(2) adding a local symlink to $(BASE_LIB_PATH) that uses the .so major
version number so that ld would find the shared library at
execution time.
Thanks to Sajid Ali for reporting this issue, to Devin Matthews for
pointing out the need for the -rpath option, and to Devangi Parikh for
helping Sajid isolate the problem.
- Added #include <ctype.h> to bli_system.h to avoid a compiler warning
resulting from using toupper() from bli_string.c without a prototype.
Thanks again to Sajid Ali, whose build log revealed this compiler
warning.
- Added '*.so.*' to .gitignore.
- CREDITS file update.
Details:
- Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which
were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the
corresponding func_t.
Details:
- Added links, and sandbox language to README.md.
- Adjusted some comments in high-level level-3 object functions to make
clear what bli_thread_init_rntm() does.
Details:
- Consolidated typed API function prototypes in bli_l1v_tapi.h by
leveraging identical function signatures between operations.
- Removed 'restrict' keyword since it is not actually present in the
function definitions.
Details:
- Added explicit typecasting to various functions (mostly static
functions), primarily those in bli_param_macro_defs.h,
bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
files.
- This change was prompted by feedback from Jacob Gorm Hansen, who
reported that #including "blis.h" from his application caused a
gcc to output error messages (relating to types being returned
mismatching the declared return types) when used via the C++ compiler
front-end. This is the first pass of fixes, and we may need to
iterate with additional follow-up commits (#233).
Details:
- Fixed an unused variable warning in frame/base/bli_rntm.c when
multithreading is disabled.
- Fixed a missing variable declaration in bli_thread_init_rntm_from_env()
when multithreading is disabled.
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
field of the cntx_t (context). The thrloop array holds the number of
ways of parallelism (thread "splits") to extract per level-3
algorithmic loop until those values can be used to create a
corresponding node in the thread control tree (thrinfo_t structure),
which (for any given level-3 invocation) usually happens by the time
the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
when invoking level-3 operations from two or more application threads.
The race condition existed because the cntx_t, a pointer to which is
usually queried from the global kernel structure (gks), is supposed to
be a read-only. However, the previous code would write to the cntx_t's
thrloop field *after* it had been queried, thus violating its read-only
status. In practice, this would not cause a problem when a sequential
application made a multithreaded call to BLIS, nor when two or more
application threads used the same parallelization scheme when calling
BLIS, because in either case all application theads would be using
the same ways of parallelism for each loop. The true effects of the
race condition were limited to situations where two or more application
theads used *different* parallelization schemes for any given level-3
call.
- In remedying the above race condition, the application or calling
library can now specify the parallelization scheme on a per-call basis.
All that is required is that the thread encode its request for
parallelism into the rntm_t struct prior to passing the address of the
rntm_t to one of the expert interfaces of either the typed or object
APIs. This allows, for example, one application thread to extract 4-way
parallelism from a call to gemm while another application thread
requests 2-way parallelism. Or, two threads could each request 4-way
parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
of the level-3 implementation stack (with the most notable exception
being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
APIs. (A few internal functions gained the rntm_t* parameter even
though they currently have no use for it, such as bli_l3_packm().)
This required some internal calls to some of those functions to
be updated since BLIS was already using those operations internally
via the expert interfaces. For situations where a rntm_t object is
not available, such as within packm/unpackm implementations, NULL is
passed in to the relevant expert interfaces. This is acceptable for
now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only
read once, at library initialization. (Thanks to Nathaniel Smith for
suggesting this to avoid repeated calls getenv(), which can be slow.)
Those values are recorded to a global rntm_t object. Public APIs, in
bli_thread.c, are still available to get/set these values from the
global rntm_t, though now the "set" functions have additional logic
to ensure that the values are set in a synchronous manner via a mutex.
If/when NULL is passed into an expert API (meaning the user opted to
not provide a custom rntm_t), the values from the global rntm_t are
copied to a local rntm_t, which is then passed down the function stack.
Calling a basic API is equivalent to calling the expert APIs with NULL
for the cntx and rntm parameters, which means the semantic behavior of
these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
and reimplemented, with the function now being able to treat the
incoming rntm_t in a manner agnostic to its origin--whether it came
from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
as the corresponding "get" functions. The new model simplifies these
interfaces so that one must either set the total number of threads, OR
set all of the ways of parallelism for each loop simultaneously (in a
single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
(and two specific ways within each method) of requesting parallelism
in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
Details:
- Renamed various files that were previously named according to a
"with context" or "without context" convention. For example, the
following files in frame/3 were renamed:
frame/3/bli_l3_oapi_woc.c -> frame/3/bli_l3_oapi_ba.c
frame/3/bli_l3_oapi_wc.c -> frame/3/bli_l3_oapi_ex.c
frame/3/bli_l3_tapi_woc.c -> frame/3/bli_l3_tapi_ba.c
frame/3/bli_l3_tapi_wc.c -> frame/3/bli_l3_tapi_ex.c
Here, the "ba" is for "basic" and "ex" is for "expert". This new
naming scheme will make more sense especially if/when additional
expert parameters are added to the expert APIs (typed and object).
Details:
- Split existing typed APIs into two subsets of interfaces: one for use
with expert parameters, such as the cntx_t*, and one without. This
separation was already in place for the object APIs, and after this
commit the typed and object APIs will have similar expert and non-
expert APIs. The expert functions will be suffixed with "_ex" just as
is the case for expert interfaces in the object APIs.
- Updated internal invocations of typed APIs (functions such as
bli_?setm() and bli_?scalv()) throughout BLIS to reflect use of the
new explictly expert APIs.
- Updated example code in examples/tapi to reflect the existence (and
usage) of non-expert APIs.
- Bumped the major soname version number in 'so_version'. While code
compiled against a previous version/commit will likely still work
(since the old typed function symbol names still exist in the new API,
just with one less function argument) the semantics of the function
have changed if the cntx_t* parameter the application passes in is
non-NULL. For example, calling bli_daxpyv() with a non-NULL context
does not behave the same way now as it did before; before, the
context would be used in the computation, and now the context would
be ignored since the interace for that function no longer expects a
context argument.
* Upload artifacts built on appveyor (#228)
* Upload artifacts
* Fix install in appveyor
* Remove windows.h in bli_winsys.c (#229)
Looks like it is unneeded.
* Implemented ARG_MAX hack in configure, Makefile.
Details:
- Added support for --enable-arg-max-hack to configure, which will
change the behavior of make when building BLIS so that rather than
invoke the archiver/linker with all of the object files as command
line arguments, those object files are echoed to a temporary file
and then the archiver/linker is fed that temporary file via the @
notation. An example of this can be found in the GNU make docs at
https://www.gnu.org/software/make/manual/make.html#File-Function
- Thanks to Isuru Fernando for prompting this feature.
* Enable x86_64 and arg-max-hack on appveyor
* Use gas style assembly for clang on windows
* Add appveyor file
* Build script
* Remove fPIC for now
* copy as
* set CC and CXX
* Change the order of immintrin.h
* Fix testsuite header
* Move testsuite defs to .c
* Fix appveyor file
* Remove fPIC again and fix strerror_r missing bug
* Remove appveyor script
* cd to blis directory
* Fix sleep implementation
* Add f2c_types_win.h
* Fix f2c compilation
* Remove rdp and rename appveyor.yml
* Remove setenv declaration in test header
* set CPICFLAGS to empty
* Fix another immintrin.h issue
* Escape CFLAGS and LDFLAGS
* Fix more ?mmintrin.h issues
* Build x86_64 in appveyor
* override LIBM LIBPTHREAD AR AS
* override pthreads in configure
* Move windows definitions to bli_winsys.h
* Fix LIBPTHREAD default value
* Build intel64 in appveyor for now
Details:
- Implemented bli_obj_scalar_cast_to(), which will typecast the value in
the internal scalar of an obj_t to a specified datatype.
- Changed bli_obj_scalar_attach() so that the scalar value being attached
is first typecast to the storage datatype of the destination object
rather than the target datatype.
- Reformatted function type signatures in bli_obj_scalar.c as well as
prototypes in its corresponding header file.
Details:
- Fixed incorrect bit shifts in the following static functions:
bli_obj_set_target_domain()
bli_obj_set_target_prec()
bli_obj_set_exec_domain()
bli_obj_set_exec_prec()
- Fixed incorrect bitmask in bli_dt_proj_to_single_prec().
- Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates
the target and exec datatypes (in addition to the storage datatypes).
Details:
- Added a new field to the auxinfo_t struct that can be used, in theory,
to request type conversion before the microkernel stores/accumulates
its microtile back to memory.
- Added the appropriate get/set static functions to bli_type_defs.h.
Details:
- Added definitions of static functions bli_dt_domain()/bli_dt_prec(),
which extract a dom_t domain or prec_t precision value, respectively,
from a num_t datatype.
- Changed the return types of bli_obj_domain() and bli_obj_prec() from
objbits_t to dom_t and prec_t. (Not sure why they were ever set to
return objbits_t.)
Details:
- Added static functions for projecting a datatype to single precision
or double precision, both for obj_t's storage datatypes and standalone
datatypes.
Details:
- Added functions to bli_obj_macro_defs.h to get and set the target
domain and target precision bits in the obj_t, and also added the
appropriate support in bli_type_defs.h.
Details:
- Implemented castm and castv operations, which behave like copym and
copyv except where the obj_t operands can be of different datatypes.
These new operations, however, unlike copym/copyv, do not build upon
existing level-1v kernels.
- Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
match the newly added frame/base/cast directory).
- Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
combinations. Previously, one had to invoke two additional macros--one
which mixed domains only and another that included all remaining
cases--in order to get full type combination coverage.
- Defined a new static function, bli_set_dims_incs_2m(), to aid in the
setting of various variables in the implementations of bli_??castm().
This static function joins others like it in bli_param_macro_defs.h.
- Comment update to bli_copysc.h.
Details:
- Added functions to bli_obj_macro_defs.h to get and set the execution
domain and execution precision bits in the obj_t.
- Added/rearranged a few functions in bli_obj_macro_defs.h.
- Renamed some macros in bli_type_defs.h: EXECUTION -> EXEC.
Details:
- Added a new static function to bli_blksz.h that scales both the default
(regular) blocksize as well as the maximum blocksize in the blksz_t
object. Reminder: maximum blocksizes have different meanings in
different contexts. For register blocksizes, they refer to the packing
register blocksizes (PACKMR or PACKNR) while for cache blocksizes, they
refer to the maximum blocksize to use during the final iteration of a
loop.
Details:
- Changed the way virtual microkernels are handled in the context.
Previously, there were query routines such as bli_cntx_get_l3_ukr_dt()
which returned the native ukernel for a datatype if the method was
equal to BLIS_NAT, or the virtual ukernel for that datatype if the
method was some other value. Going forward, the context native and
virtual ukernel slots will both be initialized to native ukernel
function pointers for native execution, and for non-native execution
the virtual ukernel pointer will be something else. This allows us
to always query the virtual ukernel slot (from within, say, the
macrokernel) without needing any logic in the query routine to decide
which function pointer (native or virtual) to return. (Essentially,
the logic has been shifted to init-time instead of compute-time.)
This scheme will also allow generalized virtual ukernels as a way
to insert extra logic in between the macrokernel and the native
microkernel.
- Initialize native contexts (in bli_cntx_ref.c) with native ukernel
function addresses stored to the virtual ukernel slots pursuant to
the above policy change.
- Renamed all static functions that were native/virtual-ambiguous, such
as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt()
pursuant to the above polilcy change. Those routines now use the
substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All
of these functions were static functions defined in bli_cntx.h, and
most uses were in level-3 front-ends and macrokernels.
- Deprecated anti_pref bool_t in context, along with related functions
such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's
panel-block execution is disabled.
Details:
- Implemented bli_acquire_mpart(), a general-purpose submatrix view
function that will alias an obj_t to be a submatrix "view" of an
existing obj_t.
- Renumbered examples in examples/oapi and inserted a new example file,
03obj_view.c, which shows how to use bli_acquire_mpart() to obtain
submatrix views of existing objects, which can then be used to
indirectly modify the parent object.
Details:
- Defined new wrappers to setm/setv operations in frame/base/bli_setri.c
that will target only the real or only the imaginary parts of a
matrix/vector object.
- Updated bli_obj_real_part() so that the complex-specific portions of
the function are not executed if the object is real.
- Defined bli_obj_imag_part().
- Caveat: If bli_obj_imag_part() is called on a real object, it does
nothing, leaving the destination object untouched. The caller must
take care to only call the function on complex objects.
- Reordered some of the static functions in bli_obj_macro_defs.h related
to aliasing.
Details:
- Fixed a bug identical to the one fixed in 0a4a27e, except this time in
the bli_obj_param_defs.h header file. It looks like the only consumers
of this static function were in bli_l0_oapi.c, and so this may not have
been manifesting (yet).
Details:
- Added logic to level-1v, -1d, -1f, -1m, -2, and -3 operations' _check()
functions to ensure that all operands are of the same datatype. There
are some exceptions that were left out, such as the _check() function
for the various norm operations since they have a different idea of
datatype consistency (ie: the norm object must be the real projection
of the primary input vector/matrix object).
Details:
- Added an implementation for bli_projv() to go along with the
implementation of bli_projm() added in 0a4a27e. The only difference
between the two is that bli_projv() may only be used on vectors,
whereas bli_projm() is general-purpose.
- Added a _check() function corresponding to bli_projv().
Details:
- Defined additional functions in bli_param_map.c:
bli_param_map_char_to_blis_dt()
bli_param_map_blis_to_char_dt()
which will map a char to its corresponding num_t, or vice versa.
Details:
- Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
behaves like bli_copym(), except that operands a and b are allowed to
contain data of differing domains (e.g. a is real while b is complex,
or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
with the intention that a 'v' vector version of the function may be
added to the same file (at some point in the future).
- Added supporting bli_check_*() functions in bli_check.c to confirm
consistent precisions between to datatypes/objects, as well as the
appropriate error message in bli_error.c and a new error code in
bli_type_defs.h.
- Wrote a bli_projm_check() function to go along with bli_projm().
- Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
which will initialize an obj_t alias to the real part of the source
object.
- Fixed a bug in the static function bli_dt_proj_to_complex(), found
in bli_param_macro_defs.h. Thankfully, there were no calls to the
function to produce buggy behavior.
- Query pack schemas in level-3 bli_*_front() functions and store those
values in the schema bitfields of the correponding obj_t's when the
cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default
native schemas are stored to the obj_t's.)
- In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in
bli_*_front(), clear the schema bitfields, and pass the queried values
into bli_gemm_cntl_create() and bli_trsm_cntl_create().
- Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to
take schemas for A and B, and use these values to initialize the
appropriate control tree nodes. (Also cpp-disabled the panel-block cntl
tree creation variant, bli_gemmpb_cntl_create(), as it has not been
employed by BLIS in quite some time.)
- Simplified querying of schema in bli_packm_init() thanks to above
changes.
- Updated openmp and pthreads definitions of bli_l3_thread_decorator()
so that thread-local aliases of matrix operands are guaranteed, even
if aliasing is disabled within the internal back-end functions (e.g.
bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c
explaining why the extra aliasing is not needed there.
- Change bli_gemm() and level-3 friends so that the operation's ind()
function is called only if all matrix operands have the same datatype,
and only if that datatype is complex. The former condition is needed
in preparation for work related to mixed domain operands, while the
latter helps with readability, especially for those who don't want to
venture into frame/ind.
- Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be
consistent with BLIS calling conventions (modified argument(s) are
last), and updated all invocations in the level-3 _front() functions.
- Comment updates to bli_cntx_set_thrloop_from_env().
Details:
- Renamed several macros defined in bli_l3_thrinfo.h designed to compute
the values of a_next and b_next to insert into an auxinfo_t struct in
level-3 macrokernels. (Previously, the macros did not use a bli_
prefix.)
- Updated instances of above macro usage within various macrokernels.
Detail:
- configure:
- add support for --enable-sandbox=NAME to configure script, where NAME
is a subdirectory of a new 'sandbox' directory that contains an
alternative implementation of gemm. (For now, only implementations of
gemm may be provided via a sandbox.);
- add support for C++ compiler. C++ compilers are handled in a manner
similar to that of C compilers, in that a default search order is
used, and that CXX is searched for first, if the variable is set. In
practice, the C++ compiler that is selected should correspond to the
selected C compiler. (Example: If gcc is selected for C, g++ should
be selected for C++.) The result of the search is output to config.mk
via build/config.mk.in. NOTE: The use of C++ in BLIS is still
hypothetical, but may eventually move to being experimental. This
support was intended only for use of C++ within a gemm sandbox.
- build/config.mk.in:
- define SANDBOX variable containing sandbox subdirectory name.
- build/bli_config.in:
- define either of the BLIS_ENABLE_SANDBOX or BLIS_DISABLE_SANDBOX
macros in bli_config.h.
- common.mk:
- include makefile fragments that were propagated into the specified
sandbox subdirectory;
- generate different CFLAGS for sandboxes, as well as a separate
CXXFLAGS variable for sandboxes when C++ source files are compiled;
- isolate into a single location lists of file suffixes for various
purposes.
- reorganized/clean up code related to identifying header files and
paths.
- Makefile:
- generate object filepaths for and compile source code files found in
sandbox sub-directory;
- remove makefile fragments placed in sandbox sub-directory (cleanmk);
- various other cleanups.
- Added .cc, .cpp, and .cxx to list of suffixes of files to recognize in
makefile fragments (via build/gen-make-frags/suffix_list).
- Updated blis.h to conditionally #include bli_sandbox.h (via a new file,
bli_sbox.h), which each sandbox is assumed to use for any type
definitions and function prototypes it wishes to export out to blis.h.
- Conditionally disable bli_gemmnat() implementation in frame/3 when
BLIS_ENABLE_SANDBOX is defined.
Details:
- Changed the void* arguments of the following static functions:
bli_is_aligned_to()
bli_is_unaligned_to()
bli_offset_past_alignment()
to siz_t, and the return type of bli_offset_past_alignment() from
guint_t to siz_t. This allows for more versatile usage of these
functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
but also in kernels/bgq, to include explicit typecasts to siz_t when
pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
#211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
various kernel license headers, likely from a malformed recursive sed
performed long ago.
Details:
- Added HP Enterprise to the LICENSE file. Previously, only the source
files touched by HPE contained the corresponding copyright notices.
(This oversight was unintentional.)
- Updated file-level copyright notices to include a comma, to match
the formatting used for UT and AMD copyrights.
Details:
- Removed critical sections protecting the initialization/finalization of
bli_memsys.c. These synchronization mechanisms are no longer needed now
that BLIS initializes all APIs via pthread_once().
Details:
- Added logic to bli_arch.c that will call what was previously the body
of bli_arch_query_id() only once and then cache the value in a static
variable local to the file. (Previously, the arch_t associated with
the hardware/configuration was queried every time bli_arch_query_id()
was called, which was at least once per level-3 function call. Thanks
to Devin Matthews for suggesting this feature via issue #175.
- Added -lpthread to the compile/link command line of the compiler
invocation that compiles build/detect/config/config_detect.c, which
prints the string identifying the detected configuration, since it
is now needed due to new pthread_once() logic in bli_arch.c.
- Implementation note: I chose to implement this arch_t caching feature
via pthread_once(), using a separate pthread_once_t variable local to
the file, rather than calling bli_init_once(). The reason is that I
did not want to require bli_init() as a prerequisite to this function.
bli_init() already calls several sub-components, some of which make use
of bli_arch_query_id(), and therefore it would be easy to fall into a
circular self-init situation (which usually causes pthreads to hang
indefinitely).
Details:
- Fixed a bug that would cause configurations to inadvertantly define
their integers to be 32 bits when those environments actually call for
64-bit integers. While either BLIS_ARCH_64 or BLIS_ARCH_32 is defined
in bli_system.h (based on whether preprocessor macros such as __x86_64
or __aarch64__ are defined by the environment), bli_system.h was being
#included *after* bli_config_macro_defs.h, in which the BLIS_ARCH_64
macro was used to choose an integer type size in the event that
BLIS_INT_TYPE_SIZE was not already defined by configure via
bli_config.h. And due to the structure of the cpp code in that file,
the 32-bit integer case was being chosen. Thanks to Francisco Igual
and Devangi Parikh for their help in isolating this bug.
- Moved the #include of hbwmalloc.h and related preprocessor code to
bli_kernel_macro_defs.h to facilitate the reshuffling of the #include
for bli_system.h in blis.h.