Commit Graph

169 Commits

Author SHA1 Message Date
Field G. Van Zee
2f3174330f Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
  the small blocks of memory (around 80-100 bytes each) used when
  creating nodes for control and thread trees (cntl_t and thrinfo_t) as
  well as thread communicators (thrcomm_t). The purpose of the small
  block allocator, or sba, is to allow the library to transition into a
  runtime state in which it does not perform any calls to malloc() or
  free() during normal execution of level-3 operations, regardless of
  the threading environment (potentially multiple application threads
  as well as multiple BLIS threads). The functionality relies on a new
  data structure, apool_t, which is (roughly speaking) a pool of
  arrays, where each array element is a pool of small blocks. The outer
  pool, which is protected by a mutex, provides separate arrays for each
  application thread while the arrays each handle multiple BLIS threads
  for any given application thread. The design minimizes the potential
  for lock contention, as only concurrent application threads would
  need to fight for the apool_t lock, and only if they happen to begin
  their level-3 operations at precisely the same time. Thanks to Kiran
  Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
  by default; renamed the --[dis|en]able-packbuf-pools option to
  --[dis|en]able-pba-pools; and rewrote the --help text associated with
  this new option and consolidated it with the --help text for the
  option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
  a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
  do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
  used for small blocks with calls to bli_sba_acquire(), which takes a
  rntm (in addition to the bytes requested), and bli_sba_release().
  These latter two functions reduce to the former two when the sba pools
  are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
  required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
  in the block_size) from bli_membrk_acquire_m() to the implementation
  of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
  relative to that of gemmtrsm_ukr, in part to avoid the need to create
  a packm control tree node, which now requires a rntm_t that has been
  initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
  run the testsuite with memory tracing enabled can check for memory
  leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
  the rntm to be copied locally when it is passed in via one of the
  expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
  thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
  formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
2018-12-25 19:35:01 -06:00
Field G. Van Zee
f808d829c5 Handle edge cases, zero-filling in packm kernels.
Details:
- Updated the API and semantics of packm kernels such that they must now
  handle edge cases, meaning that a c-by-k packm kernel must be able to
  pack edge cases that are fewer than c rows/columns and be able to
  zero-fill the remaining elements. They must also be able to zero-fill
  the equivalent region when copying fewer than k columns/rows (which is
  needed by trsm). The new packm kernel API is generally:

    void packm_kernel
         (
           conj_t           conja,
           dim_t            cdim,
           dim_t            n,
           dim_t            n_max,
           ctype*  restrict kappa,
           ctype*  restrict a, inc_t inca, inc_t lda,
           ctype*  restrict p,             inc_t ldp,
           cntx_t* restrict cntx
         );

  where cdim and n are the dimensions (short and long, respectively) of
  the submatrix being copied from the source matrix A, and n_max is the
  "full" long dimension (corresponding to the k dimension in gemm) of
  the micropanel. The "full" short dimension (corresponding to the
  register blocksize MR or NR) is not part of the API because it is
  known intrinsically by the packm kernel implementation. Thanks to
  Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
  above changes, as well as all optimized packm kernels (which only
  consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
  I was considering leaving it unchanged, but I couldn't escape the
  reality that the packm kernel API is much closer to an expert API
  than it is some obscure helper function interface within the framework
  that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
  that would have been using those kernels is knc, which is likely no
  longer being used by very many people (if any). (This also mostly
  offset the larger object code footprint incurred by moving the edge-
  case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
  which those implementations were modifying the contexts stored in the
  gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
  method implementations of trsm from executing.
2018-12-12 15:22:59 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
375eb30b0a Added mixed-precision support to 1m method.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
  datatypes (along with the computation datatype) are equal. Now, 1m may
  be used as long as all operands are stored in the complex domain. This
  change largely consisted of adding the ability to pack to 1e and 1r
  formats from one precision to another. It also required adding logic
  for handling complex values of alpha to bli_packm_blk_var1_md()
  (similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
  bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
  ukernel output preference field being read. Previously, the preference
  for the native complex ukernel was being read instead of the pref for
  the native real domain ukernel. This bug would not manifest if the
  preference for the native complex ukernel happened to be equal to that
  of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
  module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
  schemas are always read from the context, rather than trying to
  sometimes embed them directly to the A and B objects. (They are still
  embedded, but now uniformly only after reading the schemas from the
  context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
  and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
  consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
  bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
  gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
  scal2ris, and xpbyris (and their conjugating counterparts) to
  explicitly support three operand types and updated invocations to
  xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
  storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
  frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.
2018-12-03 17:49:52 -06:00
Field G. Van Zee
e769bf46b0 Tweak testsuite to issue FAIL for Nan, Inf (#279).
Details:
- Adjusted the definition for libblis_test_get_string_for_result() in
  testsuite/src/test_libblis.c so that the "FAIL" string is returned if
  the computed residual contains either NaN or Inf. Previously, a
  residual containing NaN would result in the selection of the "PASS"
  string. Thanks to Devin Matthews for reporting this issue (#279).
- Expounded on comment for the macro definitions of bli_isnan() and
  bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
  must remain macros.
2018-11-20 16:16:53 -06:00
Field G. Van Zee
4bbb454bf3 Testsuite docs update for mixed-datatype gemm.
Details:
- Updated docs/Testsuite.md to include mention of the new mixed-domain
  and mixed-precision settings, including descriptions.
- Updated docs/MixedDatatypes.md to include a brief section on running
  the testsuite to exercise mixed-datatype functionality, which mostly
  amounts to a link to the Testsuite.md document.
- Minor verbiage change to testsuite output to correct a misleading
  label associated with the value returned by the query function
  bli_info_get_simd_num_registers(). (The function does not return the
  number of SIMD registers present in the hardware, but rather a maximum
  assumed value for the purposes of allocating temporary microtile
  workspace on the function stack.)
2018-11-03 19:11:01 -05:00
Field G. Van Zee
f19c33af4c Disallow 64b BLAS integers + 32b BLIS integers.
Details:
- Print an error message from configure if the user attempts to
  explicitly configure BLIS for simultaneous use of 64-bit integers in
  the BLAS API with 32-bit integers in the BLIS API.
- Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
  integers be 64 bits if the BLAS integers are 64 bits. This and the
  above item take care of issue #274. Thanks to Devin Matthews and
  Jeff Hammond for suggesting these safeguards.
- Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
  sections and BLIS integer size line of the testsuite configuration
  output.
- Very minor edits to docs/MixedDatatypes.md.
2018-10-26 17:07:15 -05:00
Field G. Van Zee
6fbc456fb3 Added SALT testing to Travis CI.
Details:
- Modified .travis.yml to automatically employ the simulation of
  application-level threading within the testsuite, with supporting
  changes to common.mk, the top-level Makefile, and
  travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
  '.salt' suffix (similar to those with the '.fast' suffix) for
  testing application-level threading.
- Updated docs/BuildSystem.md to document the new make targets
  'testblis-salt' and 'checkblis-salt'.
2018-10-25 13:20:25 -05:00
Field G. Van Zee
4ee986f0a7 Added mixed-datatype testing to Travis CI (#271).
Details:
- Modified .travis.yml to automatically test the mixed-datatype support
  of the gemm operation, with supporting changes to common.mk, the
  top-level Makefile, and travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
  '.mixed' suffix (similar to those with the '.fast' suffix) for testing
  mixed-datatype gemm.
- Updated docs/BuildSystem.md to document the new make targets
  'testblis-md' and 'checkblis-md'.
2018-10-22 14:09:44 -05:00
Field G. Van Zee
090e4f08fc Merge branch 'master' into dev 2018-10-19 18:41:10 -05:00
Field G. Van Zee
3678a1cd51 Merge branch 'master' into win-pthreads 2018-10-19 16:11:31 -05:00
Field G. Van Zee
473ce54f5f Added bli_pthread_*() API.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
  against a Windows DLL, will be able to access pthreads functionality
  without those pthreads functions being explicitly exported by the DLL.
  Instead, we export the bli_pthread_*() layer, which uses types and
  functions that are identical to pthreads, but adds a 'bli_' prefix.
  Only a few basic functions are present in the bli_pthreads_*() API
  for now. Thanks to Devin Matthews and Isuru Fernando for their help
  on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
  pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
2018-10-18 19:03:56 -05:00
Field G. Van Zee
bb6df2814f Defined a new level-1d operation: shiftd.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
  typed APIs. This operation adds a scalar value to every element along
  an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
  terms of the addv kernel. (The scalar is passed in as the x vector
  with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
  matrix object) with use of shiftd, which is much more concise, in
  various test driver files in the testsuite. Similar changes were made
  to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
  for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
  BLISObjectAPI.md.
2018-10-18 17:11:39 -05:00
Field G. Van Zee
49d3f9fcbb Merge branch 'master' into dev 2018-10-17 18:00:40 -05:00
Devin Matthews
29e6245816 Merge branch 'master' into win-pthreads 2018-10-16 10:12:25 -05:00
Field G. Van Zee
ed65771482 Fixed merge fail on testsuite threading macros.
Details:
- Applied the following C preprocessor macro renames

    BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
    BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
    BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
    BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

  in src/test_libblis.c. This is apparently the result of a failure by
  git to properly merge the 'master' and 'amd' branches in the previous
  commit. (The 'master' branch contained a commit, 53a9ab1, in which
  these same cpp macros were renamed throughout the source distribution.
2018-10-15 17:54:45 -05:00
Field G. Van Zee
779d64dc30 Added entry for xpbym to input.operations.fast.
Details:
- Forgot to add an entry for the new xpbym operation to
  input.operations.fast in previous commit.
2018-10-15 17:13:18 -05:00
Field G. Van Zee
5fec95b99f Implemented mixed-datatype support for gemm.
Details:
- Implemented support for gemm where A, B, and C may have different
  storage datatypes, as well as a computational precision (and implied
  computation domain) that may be different from the storage precision
  of either A or B. This results in 128 different combinations, all
  which are implemented within this commit. (For now, the mixed-datatype
  functionality is only supported via the object API.) If desired, the
  mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
  that requires a single m-by-n matrix be allocated (temporarily) per
  call to gemm. This optimization aims to avoid the overhead involved in
  repeatedly updating C with general stride, or updating C after a
  typecast from the computation precision. This memory optimization may
  be disabled at configure-time (provided that the mixed-datatype
  support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
  The user may test gemm with mixed domains, precisions, both, or
  neither.
- Added a standalone test driver directory for building and running
  mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
  except that imaginary values are not touched when casting a real
  operand to a complex operand. (By contrast, in these situations castm
  sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
  usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
  also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
  when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
  accessor functions. Also commented out all usage of accessor
  functions within macrokernels. (Typecasting in the microkernel is
  still feasible, though probably unrealistic for now given the
  additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
  pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
  docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
  (courtsey of Devin Matthews).
2018-10-15 16:37:39 -05:00
Field G. Van Zee
f1dba506c9 Output threading status/params from testsuite.
Details:
- Updated testsuite to output various parameters related to parallelism
  in BLIS. These parameters include:
  - threading status: disabled, openmp, or pthreads;
  - thread partitioning for jr/ir loops: slab or rr (round-robin);
  - ways of parallelism from environment variables, and also actual
    values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
    square problems (assuming all dimensions are set to 1000);
  - automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
  libmemkind and the sandbox.
2018-10-08 17:59:41 -05:00
Devin Matthews
b8dfd82e0d Get pthreads via blis.h in the test driver. 2018-10-02 15:37:12 -05:00
Devin Matthews
627d0c5bfd Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows. 2018-10-02 14:40:55 -05:00
Field G. Van Zee
c03728f1f4 Various minor cleanups.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
  unconditionally, but differently for Windows and non-Windows, but
  then disabled the definition of bli_setenv() entirely since BLIS
  no longer needs to set environment variables. Updated bli_winsys.h
  accordingly, and call bli_sleep() from within testsuite instead of
  sleep() directly.
- Use
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
  instead of
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
  when guarding against local definition of pthread barrier in
  testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
  should always be set to 200809L when barriers are supported, though I
  won't be surprised if we encounter a case in the future where it is
  set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
  top-level Makefile. These are no longer needed because we no longer
  output libraries with the version and configuration name as
  substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
  configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
2018-09-10 17:54:27 -05:00
Field G. Van Zee
4b5437ec7a Define a cpp macro specific to BLIS compilation.
Details:
- Tweaked the cflags functions in common.mk so that a new preprocessor
  macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS
  itself is being built. This macro will not be defined when, for
  example, the testsuite or example code compiles code local to those
  applications. This was done in part by defining a new cflags function
  get-user-cflags-for(), which is now the designated function for
  application Makefiles if they wish to inherit a basic set of CFLAGS
  from BLIS. (The compiler flags returned are identical to that of
  get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is
  omitted.)
- Updated all test driver-like makefiles to call get-user-cflags-for()
  instead of get-frame-cflags-for().
2018-09-07 17:24:32 -05:00
Mathieu Poumeyrol
4e7d06700f second __APPLE__ 2018-09-06 23:48:31 +02:00
Mathieu Poumeyrol
24ecc0d94a use _POSIX_BARRIERS instead of __APPLE__ 2018-09-06 22:10:16 +02:00
Mathieu Poumeyrol
d688a2b7e5 add an adhoc impl for pthread_barrier 2018-09-06 15:31:14 +02:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
b051ffb815 Merge branch 'dev' 2018-08-29 17:06:48 -05:00
Field G. Van Zee
8199e339ae Added testsuite threading to input.general.fast.
Details:
- Added lines associated with the testsuite's new threading option to
  input.general.fast. This change was intended for the previous commit
  (10d0735).
2018-08-27 07:00:12 -05:00
Field G. Van Zee
10d07357af Better thread safety; added threading to testsuite.
Details:
- Replaced critical sections that were conditional upon multithreading
  being enabled (via pthreads or OpenMP) with unconditional use of
  pthreads mutexes. (Why pthreads? Because BLIS already requires it
  for its initialization mechanism: pthread_once().) This was done in
  bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
  mtx_t object and bli_mutex_*() API with pthread mutexes in
  bli_thread.c. The previous status quo could result in a race condition
  if the application called BLIS from more than one thread. The new
  pthread-based code should be completely agnostic to the application's
  threading configuration. Thanks to AMD for bringing to our attention
  the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
  multithreading. Specifically, each thread maintains a counter that is
  incremented after each experiment. The thread only executes the
  experiment if: counter % n_threads == thread_id. In other words, the
  threads simply take turns executing each problem experiment. Also,
  POSIX guarantees that fprintf() will not intermingle output, so
  output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
  replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
  wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
  a race condition; specifically, two threads calling the function with
  the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
  bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
  to (and updating existing lines of) source files.
2018-08-26 20:34:30 -05:00
Field G. Van Zee
0f491e994a Allow lesser Makefiles to reference installed BLIS.
Details:
- Updated the build system so that "lesser" Makefiles, such as those in
  belonging to example code or the testsuite, may be run even if the
  directory is orphaned from the original build tree. This allows a
  user to configure, compile, and install BLIS, delete the build tree
  (that is, the source distribution, or the build directory for out-
  of-tree builds) and then compile example or testsuite code and link
  against the installed copy of BLIS (provided the example or testsuite
  directory was preserved or obtained from another source). The only
  requirement is that make be invoked while setting the
  BLIS_INSTALL_PATH variable to the same installation prefix used when
  BLIS was configured. The easiest syntax is:

    make BLIS_INSTALL_PATH=/install/prefix

  though it's also permissible to set BLIS_INSTALL_PATH as an
  environment variable prior to running 'make'.
- Updated all lesser Makefiles to implement the new aforementioned build
  behavior.
- Relocated check-blastest.sh and check-blistest.sh from build to
  blastest and testsuite, respectively, so that if those directories are
  copied elsewhere the user can still run 'make check' locally.
- Updated docs/Testsuite.md with language that mentions this new option
  of building/linking against an installed copy of BLIS.
2018-08-25 20:12:36 -05:00
Field G. Van Zee
017548314f Replaced function chooser macros w/ func ptr arrays.
Details:
- Previously, most object API functions (_oapi.c) used a function
  chooser macro that would expand out to an if-elseif-elseif-else
  conditional that used a num_t datatype to call the appropriate
  type-specific API (_tapi.c). This always felt a little hackish, and
  would get in the way somewhat of addig support for new num_t datatypes
  in the future. So, I've replaced that functionality with code that
  queries a function pointer that is then typecast appropriately. This
  model of function calling was already pervasive for kernels queried
  from the cntx_t structure. It was also already in use in various other
  functions, such as macrokernels, and this commit simply extends that
  pattern.
- The above change required many new files, mostly header files, that
  define the function types (mostly _ft.h) for the queriable functions
  as well as some source files to define the function pointer arrays and
  their corresponding query functions (_fpa.c). Various other function
  types, mostly for kernel function types, were renamed to reduce the
  potential for confusion with the function types for expert and basic
  (non-expert) typed API functions.
- Removed definitions for all of the "bli_call_ft_*()" function chooser
  macros from bli_misc_macro_defs.h.
2018-08-07 14:13:25 -05:00
Field G. Van Zee
94d5ef42c8 Adjusted gflops format spec in testsuite, test/3m4m.
Details:
- Changed the format specifier for the gflops column in the testsuite
  output from %7.3f to %7.2f. This was done mainly to keep the output
  aligned properly when the expected perfomance exceeded 1000 gflops.
  Also, two decimal places still conveys plenty of precision for all
  practical applications, including just eyeballing performance deltas
  between two executions (let alone two implementations).
- Changed the format specifier for gflops in the test/3m4m drivers
  from %6.3f to %7.2f (for the same reasons listed above).
2018-08-04 15:57:17 -05:00
Field G. Van Zee
ecbebe7c2e Defined rntm_t to relocate cntx_t.thrloop (#235).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
  field of the cntx_t (context). The thrloop array holds the number of
  ways of parallelism (thread "splits") to extract per level-3
  algorithmic loop until those values can be used to create a
  corresponding node in the thread control tree (thrinfo_t structure),
  which (for any given level-3 invocation) usually happens by the time
  the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
  when invoking level-3 operations from two or more application threads.
  The race condition existed because the cntx_t, a pointer to which is
  usually queried from the global kernel structure (gks), is supposed to
  be a read-only. However, the previous code would write to the cntx_t's
  thrloop field *after* it had been queried, thus violating its read-only
  status. In practice, this would not cause a problem when a sequential
  application made a multithreaded call to BLIS, nor when two or more
  application threads used the same parallelization scheme when calling
  BLIS, because in either case all application theads would be using
  the same ways of parallelism for each loop. The true effects of the
  race condition were limited to situations where two or more application
  theads used *different* parallelization schemes for any given level-3
  call.
- In remedying the above race condition, the application or calling
  library can now specify the parallelization scheme on a per-call basis.
  All that is required is that the thread encode its request for
  parallelism into the rntm_t struct prior to passing the address of the
  rntm_t to one of the expert interfaces of either the typed or object
  APIs. This allows, for example, one application thread to extract 4-way
  parallelism from a call to gemm while another application thread
  requests 2-way parallelism. Or, two threads could each request 4-way
  parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
  of the level-3 implementation stack (with the most notable exception
  being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
  APIs. (A few internal functions gained the rntm_t* parameter even
  though they currently have no use for it, such as bli_l3_packm().)
  This required some internal calls to some of those functions to
  be updated since BLIS was already using those operations internally
  via the expert interfaces. For situations where a rntm_t object is
  not available, such as within packm/unpackm implementations, NULL is
  passed in to the relevant expert interfaces. This is acceptable for
  now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
  environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
  read once, at library initialization. (Thanks to Nathaniel Smith for
  suggesting this to avoid repeated calls getenv(), which can be slow.)
  Those values are recorded to a global rntm_t object. Public APIs, in
  bli_thread.c, are still available to get/set these values from the
  global rntm_t, though now the "set" functions have additional logic
  to ensure that the values are set in a synchronous manner via a mutex.
  If/when NULL is passed into an expert API (meaning the user opted to
  not provide a custom rntm_t), the values from the global rntm_t are
  copied to a local rntm_t, which is then passed down the function stack.
  Calling a basic API is equivalent to calling the expert APIs with NULL
  for the cntx and rntm parameters, which means the semantic behavior of
  these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
  and reimplemented, with the function now being able to treat the
  incoming rntm_t in a manner agnostic to its origin--whether it came
  from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
  parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
  as the corresponding "get" functions. The new model simplifies these
  interfaces so that one must either set the total number of threads, OR
  set all of the ways of parallelism for each loop simultaneously (in a
  single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
  (and two specific ways within each method) of requesting parallelism
  in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
2018-07-17 18:37:32 -05:00
Isuru Fernando
14648e1376 Native windows support using clang (#227)
* Add appveyor file

* Build script

* Remove fPIC for now

* copy as

* set CC and CXX

* Change the order of immintrin.h

* Fix testsuite header

* Move testsuite defs to .c

* Fix appveyor file

* Remove fPIC again and fix strerror_r missing bug

* Remove appveyor script

* cd to blis directory

* Fix sleep implementation

* Add f2c_types_win.h

* Fix f2c compilation

* Remove rdp and rename appveyor.yml

* Remove setenv declaration in test header

* set CPICFLAGS to empty

* Fix another immintrin.h issue

* Escape CFLAGS and LDFLAGS

* Fix more ?mmintrin.h issues

* Build x86_64 in appveyor

* override LIBM LIBPTHREAD AR AS

* override pthreads in configure

* Move windows definitions to bli_winsys.h

* Fix LIBPTHREAD default value

* Build intel64 in appveyor for now
2018-07-04 17:48:42 -05:00
Field G. Van Zee
f1908d3976 Fixed broken input.operations.fast.
Details:
- Removed three input lines from input.operations.fast (labeled
  "test sequential micro-kernel") that I intended to remove in bd02c4e.
  These lines prevented 'make check' (and 'make checkblis-fast') from
  completing correctly. Note: This bug was fixed in 3df39b3, but that
  commit has not yet been merged into master, hence this redundant
  commit. Thanks to Robert van de Geijn for reporting this issue.
2018-06-08 14:22:22 -05:00
Field G. Van Zee
bd02c4e9f7 Cleanups to testsuite, input.operations format.
Details:
- Removed the line in each operation entry in input.operations titled
  "test sequential front-end" and the corresponding support for the lines
  in the testsuite input parsing code. This line was included in the some
  of the earliest versions of the testsuite, back when I intended to
  eventually have separate multithreaded APIs. Specifically, I envisioned
  that multithreaded and sequential testing could be enabled or disabled
  on an operation level. However, BLIS evolved in a different direction
  and still does not have multithreaded-specific APIs (even if it will
  eventually someday). But even if it did have such APIs, I doubt I would
  allow the user to enable/disable them on an operation level. Thus, this
  was a zombie future parameter that was never used and never made sense
  to begin with. The one instance of the front_seq variable, used in the
  various libblis_test_<operation>() functions to guard the call to the
  operation test driver, that remains was commented out instead of
  deleted so that someday it could be easily changed via sed, if desired.
- Various minor cleanups to the testsuite code, including consolidating
  use of DISABLE and DISABLE_ALL and reexpressing certain conditional
  expressions in the libblis_test_<operation>() functions in terms of
  boolean functions.
2018-06-04 13:42:17 -05:00
Field G. Van Zee
4fb353bd90 Merge branch 'master' into dev 2018-05-13 17:50:51 -05:00
Field G. Van Zee
bf03503059 Renamed (shortened) a few build system variables.
Details:
- Renamed the following variables in config.mk (via build/config.mk.in):
    BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE
    BLIS_ENABLE_STATIC_BUILD        -> MK_ENABLE_STATIC
    BLIS_ENABLE_SHARED_BUILD        -> MK_ENABLE_SHARED
    BLIS_ENABLE_BLAS2BLIS           -> MK_ENABLE_BLAS
    BLIS_ENABLE_CBLAS               -> MK_ENABLE_CBLAS
    BLIS_ENABLE_MEMKIND             -> MK_ENABLE_MEMKIND
  and also renamed all uses of these variables in makefiles and makefile
  fragments. Notice that we use the "MK_" prefix so that those variables
  can be easily differentiated (such as via grep) from their "BLIS_" C
  preprocessor macro counterparts.
- Other whitespace changes to build/config.mk.in.
- Renamed the following C preprocessor macros in bli_config.h (via
  build/bli_config.h.in):
    BLIS_ENABLE_BLAS2BLIS        -> BLIS_ENABLE_BLAS
    BLIS_DISABLE_BLAS2BLIS       -> BLIS_DISABLE_BLAS
    BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE
  and also renamed all relevant uses of these macros in BLIS source
  files.
- Renamed "blas2blis" variable occurrences in configure to "blas", as
  was done in build/config.mk.in and build/bli_config.h.in.
- Renamed the following functions in frame/base/bli_info.c:
    bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas()
    bli_info_get_blas2blis_int_type_size()
                                    -> bli_info_get_blas_int_type_size()
- Remove bli_config.h during 'make cleanh' target of top-level Makefile.
2018-05-08 16:49:22 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
75d0d1057d Renamed various datatype-related macros/functions.
Details:
- Renamed the following macros in bli_obj_macro_defs.h and
  bli_param_macro_defs.h:
  - bli_obj_datatype()                 -> bli_obj_dt()
  - bli_obj_target_datatype()          -> bli_obj_target_dt()
  - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
  - bli_obj_set_datatype()             -> bli_obj_set_dt()
  - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
  - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
  - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
  - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
  - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
  - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
- Renamed the following functions in bli_obj.c:
  - bli_datatype_size()                -> bli_dt_size()
  - bli_datatype_string()              -> bli_dt_string()
  - bli_datatype_union()               -> bli_dt_union()
- Removed a pair of old level-1f penryn intrinsics kernels that were no
  longer in use.
2018-04-30 14:57:33 -05:00
Field G. Van Zee
2b7108a8ef Minor updates to test driver makefiles.
Details:
- Cleaned up and homogenized the various test driver Makefiles in
  testsuite and test directories.
- Very minor updates to test driver code.
2018-04-16 12:35:53 -05:00
Field G. Van Zee
7dc40eafdd Updates to top-level and test driver Makefiles.
Details:
- Added logic to common.mk that will choose a BLIS library against which
  to link (LIBBLIS_LINK). The default choice is the static (.a) library;
  the shared (.so) library is chosen only if the shared library build was
  enabled and the static one was disabled.
- Updated the various test driver Makefiles to reference this common,
  pre-chosen library against which to link. (Previously, these drivers
  unconditionally linked against the static library and would have
  failed if the static library build was disabled at configure-time.)
- Renamed many of the variables in common.mk and the top-level Makefile
  so that variables relating to the libblis.[a|so] files, including
  paths to those files, begin with "LIBBLIS".
- Shuffled around some of the library definitions from the top-level
  Makefile to common.mk.
- Renamed BLIS_ENABLE_DYNAMIC_BUILD to BLIS_ENABLE_SHARED_BUILD, and
  the @enable_dynamic@ anchor to @enable_shared@ in build/config.mk.in
  and in configure.
- A few other cleanups in the top-level Makefile.
2018-03-21 18:39:16 -05:00
Field G. Van Zee
97e1eeade3 Added input.operations.fast file for 'make check'.
Details:
- Added an 'input.operations.fast' file to testsuite directory to go
  along with the 'input.general.fast' file used by the 'make check'
  target in the top-level Makefile. This will allow the "fast" check
  to prune operations and/or parameter combinations from the test
  space in order to save time.
- Currently, input.operations.fast prunes trmm3 and all transposition
  and conjugation parameters from the level-3 test space.
- Reduced problem size tested in input.general.fast to 100 and disabled
  testing of 1m method.
2018-03-21 15:47:11 -05:00
Field G. Van Zee
664ec4813d Integrated f2c'ed netlib BLAS test suite.
Details:
- Created a new test suite that exercises only the BLAS compatibility
  found in BLIS. The test suite is a straightforward port of code
  obtained from netlib LAPACK, run through f2c and linked to a stripped-
  down version of libf2c that is compiled along with the test drivers
  (to prevent any obvious ABI issues). The new BLAS test suite can be
  run from within its new local directory, 'blastest' (through its local
  'make ; make run' targets) or from the top-level Makefile (via the
  'make testblas' target). Output files are created in whatever directory
  the test drivers are run, whether it be the 'blastest' directory, the
  top-level source distribution directory, or the out-of-tree directory
  in which 'configure' was run. Also, the results of the BLAS test suite
  can be checked via 'make checkblas', which summarizes the presence or
  absence of test failures in a single line printed to stdout.
- Updated the 'test' target to run both 'testblis' and 'testblas'.
- Added a new 'testblis-fast' target that runs the BLIS testsuite with
  smaller problem sizes, allowing it to finish more quickly.
- Added a 'make check' target, which runs 'checkblis-fast' and
  'checkblas'.
- Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of
  'testblis' before (calling the check-blistest.sh script to check the
  result manually).
- Renamed some targets in the top-level Makefile to be consistent between
  BLAS and BLIS.
2018-03-20 13:54:58 -05:00
Field G. Van Zee
c4f1d18b97 Minor typo fix to printing arch in testsuite.
Details:
- Mistakenly was calling bli_cpuid_query_id() instead of
  bli_arch_query_id() in the recent addition to the testsuite output
  that prints the active sub-configuration. The former function is
  only used for multi-architecture builds, whereas the latter is the
  more general option that also works for single configuration
  (including 'configure auto') builds.
2018-03-14 19:10:09 -05:00
Field G. Van Zee
fc6a184251 Print sub-configuration name in testsuite output.
Details:
- Added a line to the testsuite output that prints the name of the
  current/active sub-configuration. This is useful when linking the
  testsuite against multi-configuration builds because it confirms
  the sub-configuration that is actually being employed at runtime.
  Thanks to Devin Matthews for suggesting this feature.
2018-03-14 15:31:17 -05:00
Field G. Van Zee
1ef9360b1f Enable non-unit vector stride tests by default.
Details:
- Change "vector storage schemes to test" parameter in testsuite's
  input.general file to "cj". This means that both unit stride column
  vectors and non-unit stride column vectors will be tested in
  operations with vector operands (e.g. level-1v, level-1f, level-2).
- Very minor comment (typo) changes to input.operations.
2018-03-01 14:36:39 -06:00
Field G. Van Zee
8c4e55a1a1 Added individual operation overrides in testsuite.
Details:
- Updated the testsuite driver so that setting one or more individual
  operation test switches to "2" in input.operations will enable ONLY
  those operations and disable all others, regardless of the values of
  the section overrides and other operation switches. This makes it
  every easy to quickly test only one or two operations, and equally
  easy to revert back to the previous combination of operation tests.
- Added more comments to input.operations describing the use of
  individual "enable only" overrides.
2018-02-28 17:01:47 -06:00
Field G. Van Zee
16813335bd Merge branch 'amd' into rt
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
  Special thanks to AMD for their contributions to-date, especially with
  regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
  bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
  the extra cost of transposing the microtile in registers, this is
  much faster than using the general storage case when the underlying
  matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
  column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
  kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
  to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
  family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
  - only one vzeroupper instruction is called prior to returning
  - movapd/movupd are used in leiu of movaps/movups for double-real
    microkernels. (Note that single-real microkernels still use
    movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
  now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
  in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
  behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
  sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
  comments.
- Updated LICENSE file to explicitly mention that parts are copyright
  UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
  s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
  and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
  implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
  sum of the squares via dotv(). If there is a floating-point exception
  (FE_OVERFLOW), then the previous (numerically conservative) code is
  used; otherwise, the result of dotv() is square-rooted and stored as
  the result. This new implementation is only enabled when FE_OVERFLOW
  is #defined. If the macro is not #defined, then the previous
  implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
2018-02-21 17:43:32 -06:00