Commit Graph

259 Commits

Author SHA1 Message Date
Devin Matthews
29e6245816 Merge branch 'master' into win-pthreads 2018-10-16 10:12:25 -05:00
Devin Matthews
0b73209f6b Add missing argument to WaitForSingleObject and use $is_win in configure
to turn off pthreads.
2018-10-16 10:02:06 -05:00
Field G. Van Zee
dc5fd898af Merge branch 'amd' 2018-10-15 17:41:35 -05:00
Field G. Van Zee
53a9ab1c85 Renamed thread auto-factorization macro constants.
Details:
- Renamed the following C preprocessor macros whose fallback/default
  values are specified within frame/include/bli_kernel_macro_defs.h:

    BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
    BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
    BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
    BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

- Renamed the above cpp macro overrides within the knl, skx, and zen
  sub-configurations, as well as invocations of those macros in
  bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
  used by any code within BLIS.
2018-10-10 15:11:09 -05:00
Field G. Van Zee
e2a59400bd Allow trsm_l parallelism in the jc loop.
Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
  loop. This was unnecessary and to some degree detrimental on some
  types of hardware. Now, any parallelism bound for the jc loop will be
  applied to the jc loop, while all other loops' parallelism is funneled
  to the jr loop. Thanks to Devangi Parikh for helping investigate this
  issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
  right-side trsm is currently implemented in terms of the left-side
  case, and thus the change effectively applies to both left and right
  cases.
2018-10-09 15:29:48 -05:00
Field G. Van Zee
f1dba506c9 Output threading status/params from testsuite.
Details:
- Updated testsuite to output various parameters related to parallelism
  in BLIS. These parameters include:
  - threading status: disabled, openmp, or pthreads;
  - thread partitioning for jr/ir loops: slab or rr (round-robin);
  - ways of parallelism from environment variables, and also actual
    values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
    square problems (assuming all dimensions are set to 1000);
  - automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
  libmemkind and the sandbox.
2018-10-08 17:59:41 -05:00
Field G. Van Zee
98e01ea04b Merge branch 'master' into amd 2018-10-04 20:44:12 -05:00
Field G. Van Zee
541b8a3b3e Removed 1h short-circuit from bli_clock_min_diff().
Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
  time delta was greater than 60 minutes. This was originally intended
  to disregard extremely large values under the assumption that the
  user probably didn't intend to run a test that long. However, since
  it is in bli_clock_min_diff(), it doesn't actually help short-circuit
  an implementation that is hanging or looping infinitely, since such
  an implementation would first have to finish before the
  bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
  reporting this issue.
2018-10-04 20:39:06 -05:00
Devin Matthews
d0c0c20b7b There seems to be a problem with _POSIX_BARRIERS on Travis. 2018-10-02 15:16:00 -05:00
Devin Matthews
0904d9e4df *Always* use Windows primitives instead of pthreads. 2018-10-02 15:04:36 -05:00
Devin Matthews
627d0c5bfd Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows. 2018-10-02 14:40:55 -05:00
Devin Matthews
81d2c064a2 Add wrapper for basic pthreads functionality (mutex, once) with MSVC. 2018-10-02 11:46:36 -05:00
Field G. Van Zee
ac18949a4b Multithreading optimizations for l3 macrokernels.
Details:
- Adjusted the method by which micropanels are assigned to threads in
  the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
  employ contiguous "slab" partitioning rather than interleaved (round
  robin) partitioning. The new partitioning schemes and related details
  for specific families of operations are listed below:
  - gemm: slab partitioning.
  - herk: slab partitioning for region corresponding to non-triangular
          region of C; round robin partitioning for triangular region.
  - trmm: slab partitioning for region corresponding to non-triangular
          region of B; round robin partitioning for triangular region.
          (NOTE: This affects both left- and right-side macrokernels:
          trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
  - trsm: slab partitioning.
          (NOTE: This only affects only left-side macrokernels trsm_ll,
          trsm_lu; right-side macrokernels were not touched.)
  Also note that the previous macrokernels were preserved inside of
  the 'other' directory of each operation family directory (e.g.
  frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
  and fixed a stale function pointer type in blx_gemm_int.c
  (gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
  and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
  and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
2018-09-30 18:54:56 -05:00
Field G. Van Zee
c03728f1f4 Various minor cleanups.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
  unconditionally, but differently for Windows and non-Windows, but
  then disabled the definition of bli_setenv() entirely since BLIS
  no longer needs to set environment variables. Updated bli_winsys.h
  accordingly, and call bli_sleep() from within testsuite instead of
  sleep() directly.
- Use
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
  instead of
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
  when guarding against local definition of pthread barrier in
  testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
  should always be set to 200809L when barriers are supported, though I
  won't be surprised if we encounter a case in the future where it is
  set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
  top-level Makefile. These are no longer needed because we no longer
  output libraries with the version and configuration name as
  substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
  configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
2018-09-10 17:54:27 -05:00
Isuru Fernando
e93b01ff60 Windows DLL support (#246)
* Enable shared

* Enable rdp

* Add support for dll

* Use libblis-symbols.def

* Fix building dlls

* Fix libblis-symbols.def

* Fix soname

* Fix Makefile error

* Fix install target

* Fix missing symbols

* Add BLIS_MINUS_TWO

* Add path to dll

* Fix OSX soname

* Add declspec for dll

* Add -DBLIS_BUILD_DLL

* Replace @enable_shared@ in config

* switch to auto for now

* blis_ -> bli_

* Remove BLIS_BUILD_DLL in make check

* change auto->haswell

* enable_shared_01

* Add wno-macro-redefined

* print out.cblat3

* BLIS_BUILD_DLL -> BLIS_IS_BUILDING_LIBRARY

* Use V=1

* Remove fpic for windows

* Remember LIBPTHREAD

* Remove libm for windows

* Remember AR

* Fix remembering libpthread

* Add Wno-maybe-uninitialized in only gcc

* Don't do blastest for shared for now

* Fix install target

And remove unnecessary change

* test auto and x86_64

* Fix install target again

* Use IS_WIN variable

* Remove leading dot from LIBBLIS_SO_MAJ_EXT

* Make is_win yes/no

* Add comments for windows builds

* Change if else blocks location
2018-09-09 15:57:43 -05:00
Field G. Van Zee
fb81c7fc66 Defined cortexa53 sub-configuration.
Details:
- Added a new sub-configuration 'cortexa53', which is a mirror image
  of cortexa57 except that it will use slightly different compiler
  flags. Thanks to Mathieu Poumeyrol for making this suggestion after
  discovering that the compiler flags being used by cortexa57 were
  not working properly in certain OS X environments (the fix to which
  is currently pending in pull request #245).
2018-09-06 16:29:39 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
10d07357af Better thread safety; added threading to testsuite.
Details:
- Replaced critical sections that were conditional upon multithreading
  being enabled (via pthreads or OpenMP) with unconditional use of
  pthreads mutexes. (Why pthreads? Because BLIS already requires it
  for its initialization mechanism: pthread_once().) This was done in
  bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
  mtx_t object and bli_mutex_*() API with pthread mutexes in
  bli_thread.c. The previous status quo could result in a race condition
  if the application called BLIS from more than one thread. The new
  pthread-based code should be completely agnostic to the application's
  threading configuration. Thanks to AMD for bringing to our attention
  the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
  multithreading. Specifically, each thread maintains a counter that is
  incremented after each experiment. The thread only executes the
  experiment if: counter % n_threads == thread_id. In other words, the
  threads simply take turns executing each problem experiment. Also,
  POSIX guarantees that fprintf() will not intermingle output, so
  output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
  replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
  wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
  a race condition; specifically, two threads calling the function with
  the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
  bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
  to (and updating existing lines of) source files.
2018-08-26 20:34:30 -05:00
Field G. Van Zee
658f0a129b Fixed obscure integer size bug in va_arg() usage.
Details:
- Fixed a bug in the way that the variadic bli_cntx_set_l3_nat_ukrs()
  function was defined. This function is meant to take a microkernel id,
  microkernel datatype, microkernel address, and microkernel preference
  as arguments, and is typically called within the bli_cntx_init_*()
  function defined within a sub-configuration for initializing an
  appropriate context. The problem is with the final argument: the
  microkernel preference. These preferences are actually boolean values,
  0 or 1 (encoded as FALSE or TRUE). Since the variadic function does
  not give the compiler any type information for any variadic arguments,
  they are "promoted" in the course of internal (macroized) processing
  according to default argument promotion rules. Thus, integer literals
  such as 0 and 1 become int and floating-point literals (such as 0.0 or
  1.0) become double. Previous to this commit, we indicated to va_arg()
  that the ukernel preference was a 'bool_t', which is a typedef of
  int64_t on 64-bit systems. On systems where int is defined as 64 bits,
  no problems manifest since int is the same size as the type we passed
  in to va_arg(), but on systems where int is 32 bits, the ukernel
  preference could be misinterpreted as a garbage value. (This was
  observed on a modern armv8 system.) The fix was to interpret the
  bool_t value as int and then immediately typecast it to and store it
  as a bool_t. Special thanks to Devangi Parikh for helping track down
  this issue, including deciphering the use of va_arg() and its
  byzantine treatment of types.
- Added explicit typecasts for all invocations of va_arg() in
  bli_cntx.c.
2018-08-24 17:49:37 -05:00
Field G. Van Zee
e71dc38912 Fixed a very minor memory leak in gks.
Details:
- Fixed a memory leak in the global kernel structure that resulted in 56
  bytes per configured architecture (of which only 18 are presently
  supported by BLIS). The leak would only manifest if BLIS was
  initialized and then finalized before the application terminated.
  Thanks to Devangi Parikh for helping track down this leak.
2018-08-24 15:56:04 -05:00
Field G. Van Zee
a7e3a5f975 Fixed uncallable bli_finalize().
Details:
- Previously, bli_finalize_once()--which, like bli_init_once(), was
  implemented in terms of pthread_once()--was using the same
  pthread_once_t control object being used by bli_init(), thus
  guaranteeing that it would never be called as long as BLIS had already
  been initialized. This could manifest as a rather large memory leak to
  any application that attempted to finalize BLIS midway through its
  execution (since BLIS reserves several megabytes of storage for
  packing buffers per thread used). The fix entailed giving each
  function its own pthread_once_t object. Thanks to Devangi Parikh for
  helping track down this very quiet bug.
2018-08-24 14:51:11 -05:00
Devangi N. Parikh
6074082cd3 Fixed bug in bli_cntx_set_packm_ker_dt() implementation.
Details:
- Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which
   were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the
   corresponding func_t.
2018-08-01 13:30:51 -05:00
Field G. Van Zee
b7db293323 Explicitly typecast return vals in static funcs.
Details:
- Added explicit typecasting to various functions (mostly static
  functions), primarily those in bli_param_macro_defs.h,
  bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
  files.
- This change was prompted by feedback from Jacob Gorm Hansen, who
  reported that #including "blis.h" from his application caused a
  gcc to output error messages (relating to types being returned
  mismatching the declared return types) when used via the C++ compiler
  front-end. This is the first pass of fixes, and we may need to
  iterate with additional follow-up commits (#233).
2018-07-19 11:14:30 -05:00
Field G. Van Zee
fa08e5ead9 Fixed minor issues in ecbebe7 with mt disabled.
Details:
- Fixed an unused variable warning in frame/base/bli_rntm.c when
  multithreading is disabled.
- Fixed a missing variable declaration in bli_thread_init_rntm_from_env()
  when multithreading is disabled.
2018-07-17 19:02:15 -05:00
Field G. Van Zee
ecbebe7c2e Defined rntm_t to relocate cntx_t.thrloop (#235).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
  field of the cntx_t (context). The thrloop array holds the number of
  ways of parallelism (thread "splits") to extract per level-3
  algorithmic loop until those values can be used to create a
  corresponding node in the thread control tree (thrinfo_t structure),
  which (for any given level-3 invocation) usually happens by the time
  the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
  when invoking level-3 operations from two or more application threads.
  The race condition existed because the cntx_t, a pointer to which is
  usually queried from the global kernel structure (gks), is supposed to
  be a read-only. However, the previous code would write to the cntx_t's
  thrloop field *after* it had been queried, thus violating its read-only
  status. In practice, this would not cause a problem when a sequential
  application made a multithreaded call to BLIS, nor when two or more
  application threads used the same parallelization scheme when calling
  BLIS, because in either case all application theads would be using
  the same ways of parallelism for each loop. The true effects of the
  race condition were limited to situations where two or more application
  theads used *different* parallelization schemes for any given level-3
  call.
- In remedying the above race condition, the application or calling
  library can now specify the parallelization scheme on a per-call basis.
  All that is required is that the thread encode its request for
  parallelism into the rntm_t struct prior to passing the address of the
  rntm_t to one of the expert interfaces of either the typed or object
  APIs. This allows, for example, one application thread to extract 4-way
  parallelism from a call to gemm while another application thread
  requests 2-way parallelism. Or, two threads could each request 4-way
  parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
  of the level-3 implementation stack (with the most notable exception
  being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
  APIs. (A few internal functions gained the rntm_t* parameter even
  though they currently have no use for it, such as bli_l3_packm().)
  This required some internal calls to some of those functions to
  be updated since BLIS was already using those operations internally
  via the expert interfaces. For situations where a rntm_t object is
  not available, such as within packm/unpackm implementations, NULL is
  passed in to the relevant expert interfaces. This is acceptable for
  now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
  environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
  read once, at library initialization. (Thanks to Nathaniel Smith for
  suggesting this to avoid repeated calls getenv(), which can be slow.)
  Those values are recorded to a global rntm_t object. Public APIs, in
  bli_thread.c, are still available to get/set these values from the
  global rntm_t, though now the "set" functions have additional logic
  to ensure that the values are set in a synchronous manner via a mutex.
  If/when NULL is passed into an expert API (meaning the user opted to
  not provide a custom rntm_t), the values from the global rntm_t are
  copied to a local rntm_t, which is then passed down the function stack.
  Calling a basic API is equivalent to calling the expert APIs with NULL
  for the cntx and rntm parameters, which means the semantic behavior of
  these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
  and reimplemented, with the function now being able to treat the
  incoming rntm_t in a manner agnostic to its origin--whether it came
  from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
  parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
  as the corresponding "get" functions. The new model simplifies these
  interfaces so that one must either set the total number of threads, OR
  set all of the ways of parallelism for each loop simultaneously (in a
  single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
  (and two specific ways within each method) of requesting parallelism
  in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
2018-07-17 18:37:32 -05:00
Field G. Van Zee
c422a5cd19 Merge branch 'dev' 2018-07-05 12:33:35 -05:00
Isuru Fernando
b6470262ea Remove windows.h in bli_winsys.c (#229)
Looks like it is unneeded.
2018-07-04 20:14:29 -05:00
Field G. Van Zee
89e178ce38 Merge branch 'master' into dev 2018-07-04 17:51:16 -05:00
Isuru Fernando
14648e1376 Native windows support using clang (#227)
* Add appveyor file

* Build script

* Remove fPIC for now

* copy as

* set CC and CXX

* Change the order of immintrin.h

* Fix testsuite header

* Move testsuite defs to .c

* Fix appveyor file

* Remove fPIC again and fix strerror_r missing bug

* Remove appveyor script

* cd to blis directory

* Fix sleep implementation

* Add f2c_types_win.h

* Fix f2c compilation

* Remove rdp and rename appveyor.yml

* Remove setenv declaration in test header

* set CPICFLAGS to empty

* Fix another immintrin.h issue

* Escape CFLAGS and LDFLAGS

* Fix more ?mmintrin.h issues

* Build x86_64 in appveyor

* override LIBM LIBPTHREAD AR AS

* override pthreads in configure

* Move windows definitions to bli_winsys.h

* Fix LIBPTHREAD default value

* Build intel64 in appveyor for now
2018-07-04 17:48:42 -05:00
Field G. Van Zee
d868eb3e20 Implemented bli_obj_scalar_cast_to().
Details:
- Implemented bli_obj_scalar_cast_to(), which will typecast the value in
  the internal scalar of an obj_t to a specified datatype.
- Changed bli_obj_scalar_attach() so that the scalar value being attached
  is first typecast to the storage datatype of the destination object
  rather than the target datatype.
- Reformatted function type signatures in bli_obj_scalar.c as well as
  prototypes  in its corresponding header file.
2018-06-29 12:36:04 -05:00
Field G. Van Zee
bd8c55fe26 Added dt_on_output field to auxinfo_t.
Details:
- Added a new field to the auxinfo_t struct that can be used, in theory,
  to request type conversion before the microkernel stores/accumulates
  its microtile back to memory.
- Added the appropriate get/set static functions to bli_type_defs.h.
2018-06-27 15:52:37 -05:00
Field G. Van Zee
e88a5b8da8 Implemented castm, castv operations.
Details:
- Implemented castm and castv operations, which behave like copym and
  copyv except where the obj_t operands can be of different datatypes.
  These new operations, however, unlike copym/copyv, do not build upon
  existing level-1v kernels.
- Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
  match the newly added frame/base/cast directory).
- Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
  that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
  combinations. Previously, one had to invoke two additional macros--one
  which mixed domains only and another that included all remaining
  cases--in order to get full type combination coverage.
- Defined a new static function, bli_set_dims_incs_2m(), to aid in the
  setting of various variables in the implementations of bli_??castm().
  This static function joins others like it in bli_param_macro_defs.h.
- Comment update to bli_copysc.h.
2018-06-18 15:56:26 -05:00
Field G. Van Zee
f88c2e7a53 Defined static function bli_blksz_scale_def_max().
Details:
- Added a new static function to bli_blksz.h that scales both the default
  (regular) blocksize as well as the maximum blocksize in the blksz_t
  object. Reminder: maximum blocksizes have different meanings in
  different contexts. For register blocksizes, they refer to the packing
  register blocksizes (PACKMR or PACKNR) while for cache blocksizes, they
  refer to the maximum blocksize to use during the final iteration of a
  loop.
2018-06-13 18:27:46 -05:00
Field G. Van Zee
87db5c048e Changed usage of virtual microkernel slots in cntx.
Details:
- Changed the way virtual microkernels are handled in the context.
  Previously, there were query routines such as bli_cntx_get_l3_ukr_dt()
  which returned the native ukernel for a datatype if the method was
  equal to BLIS_NAT, or the virtual ukernel for that datatype if the
  method was some other value. Going forward, the context native and
  virtual ukernel slots will both be initialized to native ukernel
  function pointers for native execution, and for non-native execution
  the virtual ukernel pointer will be something else. This allows us
  to always query the virtual ukernel slot (from within, say, the
  macrokernel) without needing any logic in the query routine to decide
  which function pointer (native or virtual) to return. (Essentially,
  the logic has been shifted to init-time instead of compute-time.)
  This scheme will also allow generalized virtual ukernels as a way
  to insert extra logic in between the macrokernel and the native
  microkernel.
- Initialize native contexts (in bli_cntx_ref.c) with native ukernel
  function addresses stored to the virtual ukernel slots pursuant to
  the above policy change.
- Renamed all static functions that were native/virtual-ambiguous, such
  as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt()
  pursuant to the above polilcy change. Those routines now use the
  substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All
  of these functions were static functions defined in bli_cntx.h, and
  most uses were in level-3 front-ends and macrokernels.
- Deprecated anti_pref bool_t in context, along with related functions
  such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's
  panel-block execution is disabled.
2018-06-12 19:38:37 -05:00
Field G. Van Zee
dbaf440540 Merge branch 'master' into dev 2018-06-11 12:37:04 -05:00
Field G. Van Zee
043d0cd37e Implemented bli_acquire_mpart(), added example code.
Details:
- Implemented bli_acquire_mpart(), a general-purpose submatrix view
  function that will alias an obj_t to be a submatrix "view" of an
  existing obj_t.
- Renumbered examples in examples/oapi and inserted a new example file,
  03obj_view.c, which shows how to use bli_acquire_mpart() to obtain
  submatrix views of existing objects, which can then be used to
  indirectly modify the parent object.
2018-06-09 13:46:49 -05:00
Field G. Van Zee
65fae95074 Implemented bli_setrm, _setim, _setrv, _setiv.
Details:
- Defined new wrappers to setm/setv operations in frame/base/bli_setri.c
  that will target only the real or only the imaginary parts of a
  matrix/vector object.
- Updated bli_obj_real_part() so that the complex-specific portions of
  the function are not executed if the object is real.
- Defined bli_obj_imag_part().
  - Caveat: If bli_obj_imag_part() is called on a real object, it does
    nothing, leaving the destination object untouched. The caller must
    take care to only call the function on complex objects.
- Reordered some of the static functions in bli_obj_macro_defs.h related
  to aliasing.
2018-06-07 17:41:09 -05:00
Field G. Van Zee
513138b1a1 Defined/implemented bli_projv().
Details:
- Added an implementation for bli_projv() to go along with the
  implementation of bli_projm() added in 0a4a27e. The only difference
  between the two is that bli_projv() may only be used on vectors,
  whereas bli_projm() is general-purpose.
- Added a _check() function corresponding to bli_projv().
2018-06-07 12:24:47 -05:00
Field G. Van Zee
b5a641e968 Added char-to-dt and dt-to-char mapping functions.
Details:
- Defined additional functions in bli_param_map.c:
    bli_param_map_char_to_blis_dt()
    bli_param_map_blis_to_char_dt()
  which will map a char to its corresponding num_t, or vice versa.
2018-06-06 19:05:37 -05:00
Field G. Van Zee
0a4a27e1a4 Defined/implemented bli_projm().
Details:
- Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
  behaves like bli_copym(), except that operands a and b are allowed to
  contain data of differing domains (e.g. a is real while b is complex,
  or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
  with the intention that a 'v' vector version of the function may be
  added to the same file (at some point in the future).
- Added supporting bli_check_*() functions in bli_check.c to confirm
  consistent precisions between to datatypes/objects, as well as the
  appropriate error message in bli_error.c and a new error code in
  bli_type_defs.h.
- Wrote a bli_projm_check() function to go along with bli_projm().
- Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
  which will initialize an obj_t alias to the real part of the source
  object.
- Fixed a bug in the static function bli_dt_proj_to_complex(), found
  in bli_param_macro_defs.h. Thankfully, there were no calls to the
  function to produce buggy behavior.
2018-06-06 19:02:29 -05:00
Field G. Van Zee
5df201260f Merge branch 'master' into dev 2018-06-05 16:14:19 -05:00
Tyler Michael Smith
96d2774b4c Make bli_auxinfo_next_b() return b_next, not a_next (#216) 2018-06-05 07:17:39 -05:00
Field G. Van Zee
ed7dedfd4a Merge branch 'master' into dev 2018-06-02 20:29:53 -05:00
Field G. Van Zee
f97a86f322 Updated setting/querying pack schema (cntx->cntl).
- Query pack schemas in level-3 bli_*_front() functions and store those
  values in the schema bitfields of the correponding obj_t's when the
  cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default
  native schemas are stored to the obj_t's.)
- In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in
  bli_*_front(), clear the schema bitfields, and pass the queried values
  into bli_gemm_cntl_create() and bli_trsm_cntl_create().
- Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to
  take schemas for A and B, and use these values to initialize the
  appropriate control tree nodes. (Also cpp-disabled the panel-block cntl
  tree creation variant, bli_gemmpb_cntl_create(), as it has not been
  employed by BLIS in quite some time.)
- Simplified querying of schema in bli_packm_init() thanks to above
  changes.
- Updated openmp and pthreads definitions of bli_l3_thread_decorator()
  so that thread-local aliases of matrix operands are guaranteed, even
  if aliasing is disabled within the internal back-end functions (e.g.
  bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c
  explaining why the extra aliasing is not needed there.
- Change bli_gemm() and level-3 friends so that the operation's ind()
  function is called only if all matrix operands have the same datatype,
  and only if that datatype is complex. The former condition is needed
  in preparation for work related to mixed domain operands, while the
  latter helps with readability, especially for those who don't want to
  venture into frame/ind.
- Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be
  consistent with BLIS calling conventions (modified argument(s) are
  last), and updated all invocations in the level-3 _front() functions.
- Comment updates to bli_cntx_set_thrloop_from_env().
2018-06-02 20:28:20 -05:00
Field G. Van Zee
965db85d29 Updated macro invocations in bli_gemm_ker_var2.c.
Details:
- Updated "get next a/b micropanel" macro invocations in
  bli_gemm_ker_var2.c according to changes in 9588625.
- Comment update in bli_cntx.c.
2018-06-01 12:32:15 -05:00
Field G. Van Zee
469727d4f8 Very minor comment updates. 2018-05-25 16:17:13 -05:00
Field G. Van Zee
5140ee3424 Updated types of bli_is_[un]aligned_to() functions.
Details:
- Changed the void* arguments of the following static functions:
    bli_is_aligned_to()
    bli_is_unaligned_to()
    bli_offset_past_alignment()
  to siz_t, and the return type of bli_offset_past_alignment() from
  guint_t to siz_t. This allows for more versatile usage of these
  functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
  but also in kernels/bgq, to include explicit typecasts to siz_t when
  pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
  #211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
  various kernel license headers, likely from a malformed recursive sed
  performed long ago.
2018-05-23 16:56:14 -05:00
Field G. Van Zee
962a706a6f Updated LICENSE file to mention HP Enterprise.
Details:
- Added HP Enterprise to the LICENSE file. Previously, only the source
  files touched by HPE contained the corresponding copyright notices.
  (This oversight was unintentional.)
- Updated file-level copyright notices to include a comma, to match
  the formatting used for UT and AMD copyrights.
2018-05-18 18:19:40 -05:00
Field G. Van Zee
af244194e7 Removed explicit critical sec. from bli_memsys.c.
Details:
- Removed critical sections protecting the initialization/finalization of
  bli_memsys.c. These synchronization mechanisms are no longer needed now
  that BLIS initializes all APIs via pthread_once().
2018-05-17 15:38:02 -05:00
Field G. Van Zee
10c9e8f952 Cache hardware's arch_t id after querying once.
Details:
- Added logic to bli_arch.c that will call what was previously the body
  of bli_arch_query_id() only once and then cache the value in a static
  variable local to the file. (Previously, the arch_t associated with
  the hardware/configuration was queried every time bli_arch_query_id()
  was called, which was at least once per level-3 function call. Thanks
  to Devin Matthews for suggesting this feature via issue #175.
- Added -lpthread to the compile/link command line of the compiler
  invocation that compiles build/detect/config/config_detect.c, which
  prints the string identifying the detected configuration, since it
  is now needed due to new pthread_once() logic in bli_arch.c.
- Implementation note: I chose to implement this arch_t caching feature
  via pthread_once(), using a separate pthread_once_t variable local to
  the file, rather than calling bli_init_once(). The reason is that I
  did not want to require bli_init() as a prerequisite to this function.
  bli_init() already calls several sub-components, some of which make use
  of bli_arch_query_id(), and therefore it would be easy to fall into a
  circular self-init situation (which usually causes pthreads to hang
  indefinitely).
2018-05-17 15:22:51 -05:00