Details:
- Previously, bli_finalize_once()--which, like bli_init_once(), was
implemented in terms of pthread_once()--was using the same
pthread_once_t control object being used by bli_init(), thus
guaranteeing that it would never be called as long as BLIS had already
been initialized. This could manifest as a rather large memory leak to
any application that attempted to finalize BLIS midway through its
execution (since BLIS reserves several megabytes of storage for
packing buffers per thread used). The fix entailed giving each
function its own pthread_once_t object. Thanks to Devangi Parikh for
helping track down this very quiet bug.
Details:
- Changed the cleanmk target to delete makefile fragments from their new
home in obj/$(CONFIG_NAME). The old definition worked only because of
a typo (REFERKN_PATH instead of REFKERN_PATH), and only in the
non-verbose (V != 1) case.
Details:
- Disable sandbox-related obj directory creation, directory mirroring,
and makefile fragment generation when a sandbox is not enabled.
- Prevent various duplicate actions by configure (such as those
mentioned above for sandboxes above).
Details:
- The docs/ConfigurationHowTo.md document was written with examples that
did not yet contain the skx sub-configuration, but the previous commit
included bli_arch.c code copied and pasted from a recent commit that
does support skx. To keep things consistent, I've removed skx from the
recently-added ConfigurationHowTo.md code snippet.
Details:
- Added missing language directing the reader to modify the config_name
string array in bli_arch.c when adding a new sub-configuration. Thanks
to Devangi Parikh for reporting this missing section.
Details:
- Fixed some stale code that was preventing the -p option to configure
from working as expected (though the --prefix option was unaffected).
This bug was was most likely introduced in 7e5648c (May 7 2018).
Thanks to Dave Love for reporting this issue.
Details:
- Redefined the 'test' make target in the top-level Makefile so that the
final result ("everything passed" or at "least one failure") is echoed
to stdout. Note that 'check' is unchanged, and thus is now effectively
a fast version of 'test'.
- Updated docs/BuildSystem.md to reflect the above change.
Details:
- Fixed a linker error that occurred when attempting to compile and link
the testsuite and/or BLAS test drivers after having configured BLIS to
only generate a shared library (no static library). The chosen
solution involved
(1) adding the local library path, $(BASE_LIB_PATH), to the search
paths for the shared library via the link option
-Wl,-rpath,$(BASE_LIB_PATH).
(2) adding a local symlink to $(BASE_LIB_PATH) that uses the .so major
version number so that ld would find the shared library at
execution time.
Thanks to Sajid Ali for reporting this issue, to Devin Matthews for
pointing out the need for the -rpath option, and to Devangi Parikh for
helping Sajid isolate the problem.
- Added #include <ctype.h> to bli_system.h to avoid a compiler warning
resulting from using toupper() from bli_string.c without a prototype.
Thanks again to Sajid Ali, whose build log revealed this compiler
warning.
- Added '*.so.*' to .gitignore.
- CREDITS file update.
Details:
- Removed a stray/accidental redefinition of axpyv and scal2v function
types in frame/1d/bli_l1d_ft.h (probably a copy/paste leftover during
development).
Details:
- Updated older _ft kernel type suffixes used within penryn level-1v
and -1f kernels to use the newer _ker_ft suffix that was introduced
in 0175483. (Thank you Travis CI.)
Details:
- Previously, most object API functions (_oapi.c) used a function
chooser macro that would expand out to an if-elseif-elseif-else
conditional that used a num_t datatype to call the appropriate
type-specific API (_tapi.c). This always felt a little hackish, and
would get in the way somewhat of addig support for new num_t datatypes
in the future. So, I've replaced that functionality with code that
queries a function pointer that is then typecast appropriately. This
model of function calling was already pervasive for kernels queried
from the cntx_t structure. It was also already in use in various other
functions, such as macrokernels, and this commit simply extends that
pattern.
- The above change required many new files, mostly header files, that
define the function types (mostly _ft.h) for the queriable functions
as well as some source files to define the function pointer arrays and
their corresponding query functions (_fpa.c). Various other function
types, mostly for kernel function types, were renamed to reduce the
potential for confusion with the function types for expert and basic
(non-expert) typed API functions.
- Removed definitions for all of the "bli_call_ft_*()" function chooser
macros from bli_misc_macro_defs.h.
Details:
- Removed the dtime (delta time, or wallclock time) column from the
matlab output of all test drivers in test, test/3m4m, test/studies.
This value was rarely (if ever) really needed and usually only served
to take up screen space.
- Updated format specifier in test/studies/skx to use %7.2f instead of
%6.3f.
- For the test drivers in 'test' directory, added an initial line of
output that sets last entry of matlab matrix to zero in order to
induce a pre-allocation of the entire array of performance results.
Details:
- Changed the format specifier for the gflops column in the testsuite
output from %7.3f to %7.2f. This was done mainly to keep the output
aligned properly when the expected perfomance exceeded 1000 gflops.
Also, two decimal places still conveys plenty of precision for all
practical applications, including just eyeballing performance deltas
between two executions (let alone two implementations).
- Changed the format specifier for gflops in the test/3m4m drivers
from %6.3f to %7.2f (for the same reasons listed above).
Details:
- Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which
were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the
corresponding func_t.
Details:
- Added links, and sandbox language to README.md.
- Adjusted some comments in high-level level-3 object functions to make
clear what bli_thread_init_rntm() does.
Details:
- Updated the typed and object APIs to include language on the rntm_t
parameters in the expert interfaces.
- Updated README to include link to object API.
Details:
- Modified a few sections to take advantage of a feature of markdown
that allows a bullet or enumeration to have multiple paragraphs. This
is a trial run to make sure the indentation looks good when rendered
in a web browser.
Details:
- Consolidated typed API function prototypes in bli_l1v_tapi.h by
leveraging identical function signatures between operations.
- Removed 'restrict' keyword since it is not actually present in the
function definitions.
Details:
- Filled in remaining section on object creation function reference
of BLISObjectAPI.md. All object management functions demonstrated as
part of the example code in examples/oapi are now documented, as well
as some other functions that are not shown in the example code.
- Updated variuos links (mostly in function index) to correctly point to
the object API reference instead of the typed API reference.
- Added documentation to getijm, setijm.
Details:
- Added explicit typecasting to various functions (mostly static
functions), primarily those in bli_param_macro_defs.h,
bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
files.
- This change was prompted by feedback from Jacob Gorm Hansen, who
reported that #including "blis.h" from his application caused a
gcc to output error messages (relating to types being returned
mismatching the declared return types) when used via the C++ compiler
front-end. This is the first pass of fixes, and we may need to
iterate with additional follow-up commits (#233).
Details:
- Fixed an unused variable warning in frame/base/bli_rntm.c when
multithreading is disabled.
- Fixed a missing variable declaration in bli_thread_init_rntm_from_env()
when multithreading is disabled.
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
field of the cntx_t (context). The thrloop array holds the number of
ways of parallelism (thread "splits") to extract per level-3
algorithmic loop until those values can be used to create a
corresponding node in the thread control tree (thrinfo_t structure),
which (for any given level-3 invocation) usually happens by the time
the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
when invoking level-3 operations from two or more application threads.
The race condition existed because the cntx_t, a pointer to which is
usually queried from the global kernel structure (gks), is supposed to
be a read-only. However, the previous code would write to the cntx_t's
thrloop field *after* it had been queried, thus violating its read-only
status. In practice, this would not cause a problem when a sequential
application made a multithreaded call to BLIS, nor when two or more
application threads used the same parallelization scheme when calling
BLIS, because in either case all application theads would be using
the same ways of parallelism for each loop. The true effects of the
race condition were limited to situations where two or more application
theads used *different* parallelization schemes for any given level-3
call.
- In remedying the above race condition, the application or calling
library can now specify the parallelization scheme on a per-call basis.
All that is required is that the thread encode its request for
parallelism into the rntm_t struct prior to passing the address of the
rntm_t to one of the expert interfaces of either the typed or object
APIs. This allows, for example, one application thread to extract 4-way
parallelism from a call to gemm while another application thread
requests 2-way parallelism. Or, two threads could each request 4-way
parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
of the level-3 implementation stack (with the most notable exception
being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
APIs. (A few internal functions gained the rntm_t* parameter even
though they currently have no use for it, such as bli_l3_packm().)
This required some internal calls to some of those functions to
be updated since BLIS was already using those operations internally
via the expert interfaces. For situations where a rntm_t object is
not available, such as within packm/unpackm implementations, NULL is
passed in to the relevant expert interfaces. This is acceptable for
now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only
read once, at library initialization. (Thanks to Nathaniel Smith for
suggesting this to avoid repeated calls getenv(), which can be slow.)
Those values are recorded to a global rntm_t object. Public APIs, in
bli_thread.c, are still available to get/set these values from the
global rntm_t, though now the "set" functions have additional logic
to ensure that the values are set in a synchronous manner via a mutex.
If/when NULL is passed into an expert API (meaning the user opted to
not provide a custom rntm_t), the values from the global rntm_t are
copied to a local rntm_t, which is then passed down the function stack.
Calling a basic API is equivalent to calling the expert APIs with NULL
for the cntx and rntm parameters, which means the semantic behavior of
these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
and reimplemented, with the function now being able to treat the
incoming rntm_t in a manner agnostic to its origin--whether it came
from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
as the corresponding "get" functions. The new model simplifies these
interfaces so that one must either set the total number of threads, OR
set all of the ways of parallelism for each loop simultaneously (in a
single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
(and two specific ways within each method) of requesting parallelism
in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.