Commit Graph

130 Commits

Author SHA1 Message Date
Field G. Van Zee
40b392de3f Annotated additional symbols for export.
Details:
- Added export annotations to additional function prototypes in order to
  accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
  test_amaxv.c.
2019-08-23 14:18:07 +05:30
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
Field G. Van Zee
bdd46f9ee8 Rewrote reference kernels to use #pragma omp simd.
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
  indexing annotated by the #pragma omp simd directive, which a compiler
  can use to vectorize certain constant-bounded loops. (The new kernels
  actually use _Pragma("omp simd") since the kernels are defined via
  templatizing macros.) Modest speedup was observed in most cases using
  gcc 5.4.0, which may improve with newer versions. Thanks to Devin
  Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
  be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
  respectively, with a default row preference for the gemm ukernel. Also
  updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
  respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
  option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
  which compiler is detected (via macros such as __GNUC__). These
  prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
  build/templates and removed AMD as a default copyright holder.
2019-01-24 17:23:18 -06:00
Field G. Van Zee
2f3174330f Implemented a pool-based small block allocator.
Details:
- Implemented a sophisticated data structure and set of APIs that track
  the small blocks of memory (around 80-100 bytes each) used when
  creating nodes for control and thread trees (cntl_t and thrinfo_t) as
  well as thread communicators (thrcomm_t). The purpose of the small
  block allocator, or sba, is to allow the library to transition into a
  runtime state in which it does not perform any calls to malloc() or
  free() during normal execution of level-3 operations, regardless of
  the threading environment (potentially multiple application threads
  as well as multiple BLIS threads). The functionality relies on a new
  data structure, apool_t, which is (roughly speaking) a pool of
  arrays, where each array element is a pool of small blocks. The outer
  pool, which is protected by a mutex, provides separate arrays for each
  application thread while the arrays each handle multiple BLIS threads
  for any given application thread. The design minimizes the potential
  for lock contention, as only concurrent application threads would
  need to fight for the apool_t lock, and only if they happen to begin
  their level-3 operations at precisely the same time. Thanks to Kiran
  Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
  by default; renamed the --[dis|en]able-packbuf-pools option to
  --[dis|en]able-pba-pools; and rewrote the --help text associated with
  this new option and consolidated it with the --help text for the
  option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
  a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
  do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
  used for small blocks with calls to bli_sba_acquire(), which takes a
  rntm (in addition to the bytes requested), and bli_sba_release().
  These latter two functions reduce to the former two when the sba pools
  are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
  required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
  in the block_size) from bli_membrk_acquire_m() to the implementation
  of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
  relative to that of gemmtrsm_ukr, in part to avoid the need to create
  a packm control tree node, which now requires a rntm_t that has been
  initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
  run the testsuite with memory tracing enabled can check for memory
  leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
  the rntm to be copied locally when it is passed in via one of the
  expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
  thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
  formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.
2018-12-25 19:35:01 -06:00
Field G. Van Zee
f808d829c5 Handle edge cases, zero-filling in packm kernels.
Details:
- Updated the API and semantics of packm kernels such that they must now
  handle edge cases, meaning that a c-by-k packm kernel must be able to
  pack edge cases that are fewer than c rows/columns and be able to
  zero-fill the remaining elements. They must also be able to zero-fill
  the equivalent region when copying fewer than k columns/rows (which is
  needed by trsm). The new packm kernel API is generally:

    void packm_kernel
         (
           conj_t           conja,
           dim_t            cdim,
           dim_t            n,
           dim_t            n_max,
           ctype*  restrict kappa,
           ctype*  restrict a, inc_t inca, inc_t lda,
           ctype*  restrict p,             inc_t ldp,
           cntx_t* restrict cntx
         );

  where cdim and n are the dimensions (short and long, respectively) of
  the submatrix being copied from the source matrix A, and n_max is the
  "full" long dimension (corresponding to the k dimension in gemm) of
  the micropanel. The "full" short dimension (corresponding to the
  register blocksize MR or NR) is not part of the API because it is
  known intrinsically by the packm kernel implementation. Thanks to
  Devin Matthews for prompting us to make this change (#282).
- Updated all reference packm kernels in ref_kernels/1m according to
  above changes, as well as all optimized packm kernels (which only
  consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
  I was considering leaving it unchanged, but I couldn't escape the
  reality that the packm kernel API is much closer to an expert API
  than it is some obscure helper function interface within the framework
  that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
  that would have been using those kernels is knc, which is likely no
  longer being used by very many people (if any). (This also mostly
  offset the larger object code footprint incurred by moving the edge-
  case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
  which those implementations were modifying the contexts stored in the
  gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
  method implementations of trsm from executing.
2018-12-12 15:22:59 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
375eb30b0a Added mixed-precision support to 1m method.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
  datatypes (along with the computation datatype) are equal. Now, 1m may
  be used as long as all operands are stored in the complex domain. This
  change largely consisted of adding the ability to pack to 1e and 1r
  formats from one precision to another. It also required adding logic
  for handling complex values of alpha to bli_packm_blk_var1_md()
  (similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
  bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
  ukernel output preference field being read. Previously, the preference
  for the native complex ukernel was being read instead of the pref for
  the native real domain ukernel. This bug would not manifest if the
  preference for the native complex ukernel happened to be equal to that
  of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
  module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
  schemas are always read from the context, rather than trying to
  sometimes embed them directly to the A and B objects. (They are still
  embedded, but now uniformly only after reading the schemas from the
  context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
  and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
  consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
  bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
  gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
  scal2ris, and xpbyris (and their conjugating counterparts) to
  explicitly support three operand types and updated invocations to
  xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
  storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
  frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.
2018-12-03 17:49:52 -06:00
Field G. Van Zee
e769bf46b0 Tweak testsuite to issue FAIL for Nan, Inf (#279).
Details:
- Adjusted the definition for libblis_test_get_string_for_result() in
  testsuite/src/test_libblis.c so that the "FAIL" string is returned if
  the computed residual contains either NaN or Inf. Previously, a
  residual containing NaN would result in the selection of the "PASS"
  string. Thanks to Devin Matthews for reporting this issue (#279).
- Expounded on comment for the macro definitions of bli_isnan() and
  bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
  must remain macros.
2018-11-20 16:16:53 -06:00
Field G. Van Zee
4bbb454bf3 Testsuite docs update for mixed-datatype gemm.
Details:
- Updated docs/Testsuite.md to include mention of the new mixed-domain
  and mixed-precision settings, including descriptions.
- Updated docs/MixedDatatypes.md to include a brief section on running
  the testsuite to exercise mixed-datatype functionality, which mostly
  amounts to a link to the Testsuite.md document.
- Minor verbiage change to testsuite output to correct a misleading
  label associated with the value returned by the query function
  bli_info_get_simd_num_registers(). (The function does not return the
  number of SIMD registers present in the hardware, but rather a maximum
  assumed value for the purposes of allocating temporary microtile
  workspace on the function stack.)
2018-11-03 19:11:01 -05:00
Field G. Van Zee
f19c33af4c Disallow 64b BLAS integers + 32b BLIS integers.
Details:
- Print an error message from configure if the user attempts to
  explicitly configure BLIS for simultaneous use of 64-bit integers in
  the BLAS API with 32-bit integers in the BLIS API.
- Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
  integers be 64 bits if the BLAS integers are 64 bits. This and the
  above item take care of issue #274. Thanks to Devin Matthews and
  Jeff Hammond for suggesting these safeguards.
- Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
  sections and BLIS integer size line of the testsuite configuration
  output.
- Very minor edits to docs/MixedDatatypes.md.
2018-10-26 17:07:15 -05:00
Field G. Van Zee
090e4f08fc Merge branch 'master' into dev 2018-10-19 18:41:10 -05:00
Field G. Van Zee
3678a1cd51 Merge branch 'master' into win-pthreads 2018-10-19 16:11:31 -05:00
Field G. Van Zee
473ce54f5f Added bli_pthread_*() API.
Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
  against a Windows DLL, will be able to access pthreads functionality
  without those pthreads functions being explicitly exported by the DLL.
  Instead, we export the bli_pthread_*() layer, which uses types and
  functions that are identical to pthreads, but adds a 'bli_' prefix.
  Only a few basic functions are present in the bli_pthreads_*() API
  for now. Thanks to Devin Matthews and Isuru Fernando for their help
  on a related PR (#261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
  pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.
2018-10-18 19:03:56 -05:00
Field G. Van Zee
bb6df2814f Defined a new level-1d operation: shiftd.
Details:
- Defined a new level-1d operation called 'shiftd', including object and
  typed APIs. This operation adds a scalar value to every element along
  an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
  terms of the addv kernel. (The scalar is passed in as the x vector
  with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
  matrix object) with use of shiftd, which is much more concise, in
  various test driver files in the testsuite. Similar changes were made
  to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
  for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
  BLISObjectAPI.md.
2018-10-18 17:11:39 -05:00
Field G. Van Zee
49d3f9fcbb Merge branch 'master' into dev 2018-10-17 18:00:40 -05:00
Devin Matthews
29e6245816 Merge branch 'master' into win-pthreads 2018-10-16 10:12:25 -05:00
Field G. Van Zee
ed65771482 Fixed merge fail on testsuite threading macros.
Details:
- Applied the following C preprocessor macro renames

    BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
    BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
    BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
    BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

  in src/test_libblis.c. This is apparently the result of a failure by
  git to properly merge the 'master' and 'amd' branches in the previous
  commit. (The 'master' branch contained a commit, 53a9ab1, in which
  these same cpp macros were renamed throughout the source distribution.
2018-10-15 17:54:45 -05:00
Field G. Van Zee
5fec95b99f Implemented mixed-datatype support for gemm.
Details:
- Implemented support for gemm where A, B, and C may have different
  storage datatypes, as well as a computational precision (and implied
  computation domain) that may be different from the storage precision
  of either A or B. This results in 128 different combinations, all
  which are implemented within this commit. (For now, the mixed-datatype
  functionality is only supported via the object API.) If desired, the
  mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
  that requires a single m-by-n matrix be allocated (temporarily) per
  call to gemm. This optimization aims to avoid the overhead involved in
  repeatedly updating C with general stride, or updating C after a
  typecast from the computation precision. This memory optimization may
  be disabled at configure-time (provided that the mixed-datatype
  support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
  The user may test gemm with mixed domains, precisions, both, or
  neither.
- Added a standalone test driver directory for building and running
  mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
  except that imaginary values are not touched when casting a real
  operand to a complex operand. (By contrast, in these situations castm
  sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
  usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
  also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
  when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
  accessor functions. Also commented out all usage of accessor
  functions within macrokernels. (Typecasting in the microkernel is
  still feasible, though probably unrealistic for now given the
  additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
  pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
  docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
  (courtsey of Devin Matthews).
2018-10-15 16:37:39 -05:00
Field G. Van Zee
f1dba506c9 Output threading status/params from testsuite.
Details:
- Updated testsuite to output various parameters related to parallelism
  in BLIS. These parameters include:
  - threading status: disabled, openmp, or pthreads;
  - thread partitioning for jr/ir loops: slab or rr (round-robin);
  - ways of parallelism from environment variables, and also actual
    values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
    square problems (assuming all dimensions are set to 1000);
  - automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
  libmemkind and the sandbox.
2018-10-08 17:59:41 -05:00
Devin Matthews
b8dfd82e0d Get pthreads via blis.h in the test driver. 2018-10-02 15:37:12 -05:00
Devin Matthews
627d0c5bfd Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows. 2018-10-02 14:40:55 -05:00
Field G. Van Zee
c03728f1f4 Various minor cleanups.
Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
  unconditionally, but differently for Windows and non-Windows, but
  then disabled the definition of bli_setenv() entirely since BLIS
  no longer needs to set environment variables. Updated bli_winsys.h
  accordingly, and call bli_sleep() from within testsuite instead of
  sleep() directly.
- Use
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
  instead of
    #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
  when guarding against local definition of pthread barrier in
  testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
  should always be set to 200809L when barriers are supported, though I
  won't be surprised if we encounter a case in the future where it is
  set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
  top-level Makefile. These are no longer needed because we no longer
  output libraries with the version and configuration name as
  substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
  configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.
2018-09-10 17:54:27 -05:00
Mathieu Poumeyrol
4e7d06700f second __APPLE__ 2018-09-06 23:48:31 +02:00
Mathieu Poumeyrol
24ecc0d94a use _POSIX_BARRIERS instead of __APPLE__ 2018-09-06 22:10:16 +02:00
Mathieu Poumeyrol
d688a2b7e5 add an adhoc impl for pthread_barrier 2018-09-06 15:31:14 +02:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
10d07357af Better thread safety; added threading to testsuite.
Details:
- Replaced critical sections that were conditional upon multithreading
  being enabled (via pthreads or OpenMP) with unconditional use of
  pthreads mutexes. (Why pthreads? Because BLIS already requires it
  for its initialization mechanism: pthread_once().) This was done in
  bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
  mtx_t object and bli_mutex_*() API with pthread mutexes in
  bli_thread.c. The previous status quo could result in a race condition
  if the application called BLIS from more than one thread. The new
  pthread-based code should be completely agnostic to the application's
  threading configuration. Thanks to AMD for bringing to our attention
  the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
  multithreading. Specifically, each thread maintains a counter that is
  incremented after each experiment. The thread only executes the
  experiment if: counter % n_threads == thread_id. In other words, the
  threads simply take turns executing each problem experiment. Also,
  POSIX guarantees that fprintf() will not intermingle output, so
  output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
  replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
  wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
  a race condition; specifically, two threads calling the function with
  the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
  bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
  to (and updating existing lines of) source files.
2018-08-26 20:34:30 -05:00
Field G. Van Zee
017548314f Replaced function chooser macros w/ func ptr arrays.
Details:
- Previously, most object API functions (_oapi.c) used a function
  chooser macro that would expand out to an if-elseif-elseif-else
  conditional that used a num_t datatype to call the appropriate
  type-specific API (_tapi.c). This always felt a little hackish, and
  would get in the way somewhat of addig support for new num_t datatypes
  in the future. So, I've replaced that functionality with code that
  queries a function pointer that is then typecast appropriately. This
  model of function calling was already pervasive for kernels queried
  from the cntx_t structure. It was also already in use in various other
  functions, such as macrokernels, and this commit simply extends that
  pattern.
- The above change required many new files, mostly header files, that
  define the function types (mostly _ft.h) for the queriable functions
  as well as some source files to define the function pointer arrays and
  their corresponding query functions (_fpa.c). Various other function
  types, mostly for kernel function types, were renamed to reduce the
  potential for confusion with the function types for expert and basic
  (non-expert) typed API functions.
- Removed definitions for all of the "bli_call_ft_*()" function chooser
  macros from bli_misc_macro_defs.h.
2018-08-07 14:13:25 -05:00
Field G. Van Zee
94d5ef42c8 Adjusted gflops format spec in testsuite, test/3m4m.
Details:
- Changed the format specifier for the gflops column in the testsuite
  output from %7.3f to %7.2f. This was done mainly to keep the output
  aligned properly when the expected perfomance exceeded 1000 gflops.
  Also, two decimal places still conveys plenty of precision for all
  practical applications, including just eyeballing performance deltas
  between two executions (let alone two implementations).
- Changed the format specifier for gflops in the test/3m4m drivers
  from %6.3f to %7.2f (for the same reasons listed above).
2018-08-04 15:57:17 -05:00
Field G. Van Zee
ecbebe7c2e Defined rntm_t to relocate cntx_t.thrloop (#235).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
  field of the cntx_t (context). The thrloop array holds the number of
  ways of parallelism (thread "splits") to extract per level-3
  algorithmic loop until those values can be used to create a
  corresponding node in the thread control tree (thrinfo_t structure),
  which (for any given level-3 invocation) usually happens by the time
  the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
  when invoking level-3 operations from two or more application threads.
  The race condition existed because the cntx_t, a pointer to which is
  usually queried from the global kernel structure (gks), is supposed to
  be a read-only. However, the previous code would write to the cntx_t's
  thrloop field *after* it had been queried, thus violating its read-only
  status. In practice, this would not cause a problem when a sequential
  application made a multithreaded call to BLIS, nor when two or more
  application threads used the same parallelization scheme when calling
  BLIS, because in either case all application theads would be using
  the same ways of parallelism for each loop. The true effects of the
  race condition were limited to situations where two or more application
  theads used *different* parallelization schemes for any given level-3
  call.
- In remedying the above race condition, the application or calling
  library can now specify the parallelization scheme on a per-call basis.
  All that is required is that the thread encode its request for
  parallelism into the rntm_t struct prior to passing the address of the
  rntm_t to one of the expert interfaces of either the typed or object
  APIs. This allows, for example, one application thread to extract 4-way
  parallelism from a call to gemm while another application thread
  requests 2-way parallelism. Or, two threads could each request 4-way
  parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
  of the level-3 implementation stack (with the most notable exception
  being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
  APIs. (A few internal functions gained the rntm_t* parameter even
  though they currently have no use for it, such as bli_l3_packm().)
  This required some internal calls to some of those functions to
  be updated since BLIS was already using those operations internally
  via the expert interfaces. For situations where a rntm_t object is
  not available, such as within packm/unpackm implementations, NULL is
  passed in to the relevant expert interfaces. This is acceptable for
  now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
  environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
  read once, at library initialization. (Thanks to Nathaniel Smith for
  suggesting this to avoid repeated calls getenv(), which can be slow.)
  Those values are recorded to a global rntm_t object. Public APIs, in
  bli_thread.c, are still available to get/set these values from the
  global rntm_t, though now the "set" functions have additional logic
  to ensure that the values are set in a synchronous manner via a mutex.
  If/when NULL is passed into an expert API (meaning the user opted to
  not provide a custom rntm_t), the values from the global rntm_t are
  copied to a local rntm_t, which is then passed down the function stack.
  Calling a basic API is equivalent to calling the expert APIs with NULL
  for the cntx and rntm parameters, which means the semantic behavior of
  these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
  and reimplemented, with the function now being able to treat the
  incoming rntm_t in a manner agnostic to its origin--whether it came
  from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
  parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
  as the corresponding "get" functions. The new model simplifies these
  interfaces so that one must either set the total number of threads, OR
  set all of the ways of parallelism for each loop simultaneously (in a
  single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
  (and two specific ways within each method) of requesting parallelism
  in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
2018-07-17 18:37:32 -05:00
Isuru Fernando
14648e1376 Native windows support using clang (#227)
* Add appveyor file

* Build script

* Remove fPIC for now

* copy as

* set CC and CXX

* Change the order of immintrin.h

* Fix testsuite header

* Move testsuite defs to .c

* Fix appveyor file

* Remove fPIC again and fix strerror_r missing bug

* Remove appveyor script

* cd to blis directory

* Fix sleep implementation

* Add f2c_types_win.h

* Fix f2c compilation

* Remove rdp and rename appveyor.yml

* Remove setenv declaration in test header

* set CPICFLAGS to empty

* Fix another immintrin.h issue

* Escape CFLAGS and LDFLAGS

* Fix more ?mmintrin.h issues

* Build x86_64 in appveyor

* override LIBM LIBPTHREAD AR AS

* override pthreads in configure

* Move windows definitions to bli_winsys.h

* Fix LIBPTHREAD default value

* Build intel64 in appveyor for now
2018-07-04 17:48:42 -05:00
Field G. Van Zee
bd02c4e9f7 Cleanups to testsuite, input.operations format.
Details:
- Removed the line in each operation entry in input.operations titled
  "test sequential front-end" and the corresponding support for the lines
  in the testsuite input parsing code. This line was included in the some
  of the earliest versions of the testsuite, back when I intended to
  eventually have separate multithreaded APIs. Specifically, I envisioned
  that multithreaded and sequential testing could be enabled or disabled
  on an operation level. However, BLIS evolved in a different direction
  and still does not have multithreaded-specific APIs (even if it will
  eventually someday). But even if it did have such APIs, I doubt I would
  allow the user to enable/disable them on an operation level. Thus, this
  was a zombie future parameter that was never used and never made sense
  to begin with. The one instance of the front_seq variable, used in the
  various libblis_test_<operation>() functions to guard the call to the
  operation test driver, that remains was commented out instead of
  deleted so that someday it could be easily changed via sed, if desired.
- Various minor cleanups to the testsuite code, including consolidating
  use of DISABLE and DISABLE_ALL and reexpressing certain conditional
  expressions in the libblis_test_<operation>() functions in terms of
  boolean functions.
2018-06-04 13:42:17 -05:00
Field G. Van Zee
4fb353bd90 Merge branch 'master' into dev 2018-05-13 17:50:51 -05:00
Field G. Van Zee
bf03503059 Renamed (shortened) a few build system variables.
Details:
- Renamed the following variables in config.mk (via build/config.mk.in):
    BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE
    BLIS_ENABLE_STATIC_BUILD        -> MK_ENABLE_STATIC
    BLIS_ENABLE_SHARED_BUILD        -> MK_ENABLE_SHARED
    BLIS_ENABLE_BLAS2BLIS           -> MK_ENABLE_BLAS
    BLIS_ENABLE_CBLAS               -> MK_ENABLE_CBLAS
    BLIS_ENABLE_MEMKIND             -> MK_ENABLE_MEMKIND
  and also renamed all uses of these variables in makefiles and makefile
  fragments. Notice that we use the "MK_" prefix so that those variables
  can be easily differentiated (such as via grep) from their "BLIS_" C
  preprocessor macro counterparts.
- Other whitespace changes to build/config.mk.in.
- Renamed the following C preprocessor macros in bli_config.h (via
  build/bli_config.h.in):
    BLIS_ENABLE_BLAS2BLIS        -> BLIS_ENABLE_BLAS
    BLIS_DISABLE_BLAS2BLIS       -> BLIS_DISABLE_BLAS
    BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE
  and also renamed all relevant uses of these macros in BLIS source
  files.
- Renamed "blas2blis" variable occurrences in configure to "blas", as
  was done in build/config.mk.in and build/bli_config.h.in.
- Renamed the following functions in frame/base/bli_info.c:
    bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas()
    bli_info_get_blas2blis_int_type_size()
                                    -> bli_info_get_blas_int_type_size()
- Remove bli_config.h during 'make cleanh' target of top-level Makefile.
2018-05-08 16:49:22 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
75d0d1057d Renamed various datatype-related macros/functions.
Details:
- Renamed the following macros in bli_obj_macro_defs.h and
  bli_param_macro_defs.h:
  - bli_obj_datatype()                 -> bli_obj_dt()
  - bli_obj_target_datatype()          -> bli_obj_target_dt()
  - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
  - bli_obj_set_datatype()             -> bli_obj_set_dt()
  - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
  - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
  - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
  - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
  - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
  - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
- Renamed the following functions in bli_obj.c:
  - bli_datatype_size()                -> bli_dt_size()
  - bli_datatype_string()              -> bli_dt_string()
  - bli_datatype_union()               -> bli_dt_union()
- Removed a pair of old level-1f penryn intrinsics kernels that were no
  longer in use.
2018-04-30 14:57:33 -05:00
Field G. Van Zee
c4f1d18b97 Minor typo fix to printing arch in testsuite.
Details:
- Mistakenly was calling bli_cpuid_query_id() instead of
  bli_arch_query_id() in the recent addition to the testsuite output
  that prints the active sub-configuration. The former function is
  only used for multi-architecture builds, whereas the latter is the
  more general option that also works for single configuration
  (including 'configure auto') builds.
2018-03-14 19:10:09 -05:00
Field G. Van Zee
fc6a184251 Print sub-configuration name in testsuite output.
Details:
- Added a line to the testsuite output that prints the name of the
  current/active sub-configuration. This is useful when linking the
  testsuite against multi-configuration builds because it confirms
  the sub-configuration that is actually being employed at runtime.
  Thanks to Devin Matthews for suggesting this feature.
2018-03-14 15:31:17 -05:00
Field G. Van Zee
8c4e55a1a1 Added individual operation overrides in testsuite.
Details:
- Updated the testsuite driver so that setting one or more individual
  operation test switches to "2" in input.operations will enable ONLY
  those operations and disable all others, regardless of the values of
  the section overrides and other operation switches. This makes it
  every easy to quickly test only one or two operations, and equally
  easy to revert back to the previous combination of operation tests.
- Added more comments to input.operations describing the use of
  individual "enable only" overrides.
2018-02-28 17:01:47 -06:00
Field G. Van Zee
16813335bd Merge branch 'amd' into rt
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
  Special thanks to AMD for their contributions to-date, especially with
  regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
  bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
  the extra cost of transposing the microtile in registers, this is
  much faster than using the general storage case when the underlying
  matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
  column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
  kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
  to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
  family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
  - only one vzeroupper instruction is called prior to returning
  - movapd/movupd are used in leiu of movaps/movups for double-real
    microkernels. (Note that single-real microkernels still use
    movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
  now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
  in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
  behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
  sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
  comments.
- Updated LICENSE file to explicitly mention that parts are copyright
  UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
  s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
  and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
  implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
  sum of the squares via dotv(). If there is a floating-point exception
  (FE_OVERFLOW), then the previous (numerically conservative) code is
  used; otherwise, the result of dotv() is square-rooted and stored as
  the result. This new implementation is only enabled when FE_OVERFLOW
  is #defined. If the macro is not #defined, then the previous
  implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
2018-02-21 17:43:32 -06:00
Field G. Van Zee
9804adfd40 Added option to disable pack buffer memory pools.
Details:
- Added a new configure option, --[en|dis]able-packbuf-pools, which will
  enable or disable the use of internal memory pools for managing buffers
  used for packing. When disabled, the function specified by the cpp
  macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed
  (and BLIS_FREE_POOL is called when the buffer is ready to be released,
  usually at the end of a loop). When enabled, which was the status quo
  prior to this commit, a memory pool data structure is created and
  managed to provide threads with packing buffers. The memory pool
  minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls
  BLIS_MALLOC_POOL), but does so through a somewhat more complex
  mechanism that may incur additional overhead in some (but not all)
  situations. The new option defaults to --enable-packbuf-pools.
- Removed the reinitialization of the memory pools from the level-3
  front-ends and replaced it with automatic reinitialization within the
  pool API's implementation. This required an extra argument to
  bli_pool_checkout_block() in the form of a requested size, but hides
  the complexity entirely from BLIS. And since bli_pool_checkout_block()
  is only ever called within a critical section, this change fixes a
  potential race condition in which threads using contexts with different
  cache blocksizes--most likely a heterogeneous environment--can check
  out pool blocks that are too small for the submatrices it wishes to
  pack. Thanks to Nisanth Padinharepatt for reporting this potential
  issue.
- Removed several functions in light of the relocation of pool reinit,
  including bli_membrk_reinit_pools(), bli_memsys_reinit(),
  bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool().
- Updated the testsuite to print whether the memory pools are enabled or
  disabled.
2017-12-21 19:22:57 -06:00
Field G. Van Zee
70640a3710 Implemented library self-initialization.
Details:
- Defined two new functions in bli_init.c: bli_init_once() and
  bli_finalize_once(). Each is implemented with pthread_once(), which
  guarantees that, among the threads that pass in the same pthread_once_t
  data structure, exactly one thread will execute a user-defined function.
  (Thus, there is now a runtime dependency against libpthread even when
  multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
  computational operations as well as many other functions in BLIS to
  all but guarantee that BLIS will self-initialize through the normal
  use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
  functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
  BLAS compatibility layer to take and return no arguments. (The
  previous API that tracked whether BLIS was initialized, and then
  only finalized if it was initialized in the same function, was too
  cute by half and borderline useless because by default BLIS stays
  initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
  bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
  bli_ind.c. We don't need to track initialization at the sub-API level,
  especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
  level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
  bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
  name. These functions had no use cases within BLIS and likely none
  outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
  main() function, and likewise for standalone test drivers in 'test'
  directory, so that self-initialization is exercised by default.
2017-12-11 17:18:43 -06:00
Field G. Van Zee
42dcd589c3 Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
95adc43d80 Moved 'family' field from cntx_t to cntl_t.
Details:
- Removed the family field inside the cntx_t struct and re-added it to the
  cntl_t struct. Updated all accessor functions/macros accordingly, as well
  as all consumers and intermediaries of the family parameter (such as
  bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
  change was motivated by the desire to keep the context limited, as much
  as possible, to information about the computing environment. (The family
  field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
  that operate only on a single struct to contain the "_node" suffix to
  differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
  They weren't being used and probably never will be.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
1d728ccb23 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2017-12-11 11:55:31 +05:30
Field G. Van Zee
270c65985d Modified bli_getopt() for thread-safety.
Details:
- Changed the interface of bli_getopt() to take a new argument, a getopt_t
  struct, that stores the values of optarg, optind, opterr, and optopt,
  and updated the implementation accordingly. (Previously,  these
  variables were assumed to be global.)
- Added a function for initializing a getopt_t struct.
- Changed test_libblis.c--currently the only consumer of bli_getopt()--to
  utilize the new getopt_t state object.
2017-12-08 15:21:18 -06:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
f5962a1aae Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-09-26 17:00:04 -05:00
Field G. Van Zee
c63980f4ca Moved 'family' field from cntx_t to cntl_t.
Details:
- Removed the family field inside the cntx_t struct and re-added it to the
  cntl_t struct. Updated all accessor functions/macros accordingly, as well
  as all consumers and intermediaries of the family parameter (such as
  bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
  change was motivated by the desire to keep the context limited, as much
  as possible, to information about the computing environment. (The family
  field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
  that operate only on a single struct to contain the "_node" suffix to
  differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
  They weren't being used and probably never will be.
2017-07-29 14:53:39 -05:00
Field G. Van Zee
126482a3b6 Implemented the 1m method.
Details:
- Implemented the 1m method for inducing complex domain matrix
  multiplication. 1m support has been added to all level-3 operations,
  including trsm, and is now the default induced method when native
  complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
  needed for the corresponding function for 1m (because 1m requires us
  to choose between column-oriented or row-oriented execution, which
  requires us to query the context for the storage preference of the
  gemm microkernel, which requires knowing the datatype) but I decided
  that it made sense for consistency to add the parameter to all other
  cntx initialization functions as well, even though those functions
  don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
  a second scalar for each blocksize entry. The semantic meaning of the
  two scalars now is that the first will scale the default blocksize
  while the second will scale the maximum blocksize. This allows scaling
  the two independently, and was needed to support 1m, which requires
  scaling for a register blocksize but not the register storage
  blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
  bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
  default and maximum blocksizes to some desired blocksize multiple.
  These functions are needed in the updated definitions of
  bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
  1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
  certain circumstances (specifically, real domain beta and row- or
  column-stored matrix C), the real domain macrokernel and microkernel
  to be called directly, rather than using the virtual microkernel
  via the complex domain macrokernel, which carries a slight additional
  amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
  some code in test_gemm.c driver.
2016-11-25 18:29:49 -06:00