Commit Graph

32 Commits

Author SHA1 Message Date
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
Minh Quan Ho
579145039d Initialize error messages at compile time (#289)
* Initialize error messages at compile time

- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.

* Retired bli_error_init(), _finalize().

Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
  bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
  the former two in bli_init.c.

* Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
  'configure --enable-cblas auto'.
2019-01-07 16:00:15 -06:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
06c23954e6 Defined unified bli_pthreads_*() API for all OSes.
Details:
- Expanded the bli_pthread_*() -> pthread_*() wrappers in
  frame/thread/bli_pthread.c to include cases for Windows taken from
  frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined
  and always used by BLIS and the BLIS testsuite (in lieu of calling
  pthreads directly, as before). The implementation used in this new
  API depends on whether we are building for Windows, and to a lesser
  extent, whether we are building on OS X. For the core API, Windows
  uses Windows threads, non-Windows (Linux, OS X) uses pthreads.
  OS X and Windows get barriers implemented in terms of other
  bli_pthread_*() functions, and Linux gets barriers implemented in
  terms of pthread_barrier*(). This commit addresses issue #273.
- Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(),
  which was erroneously calling pthread_mutex_lock().
- Minor changes to configure so that the auto-detection executable
  can be built given the above changes (most notably, turning on
  POSIX extensions via -D_GNU_SOURCE).
- Removed temporary play-test code for shiftd that accidentally got
  committed into test/3m4m/test_gemm.c.
2018-10-23 19:16:54 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
10d07357af Better thread safety; added threading to testsuite.
Details:
- Replaced critical sections that were conditional upon multithreading
  being enabled (via pthreads or OpenMP) with unconditional use of
  pthreads mutexes. (Why pthreads? Because BLIS already requires it
  for its initialization mechanism: pthread_once().) This was done in
  bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
  mtx_t object and bli_mutex_*() API with pthread mutexes in
  bli_thread.c. The previous status quo could result in a race condition
  if the application called BLIS from more than one thread. The new
  pthread-based code should be completely agnostic to the application's
  threading configuration. Thanks to AMD for bringing to our attention
  the need for a thread-safety review.
- Added an option to the testsuite to simulate application-level
  multithreading. Specifically, each thread maintains a counter that is
  incremented after each experiment. The thread only executes the
  experiment if: counter % n_threads == thread_id. In other words, the
  threads simply take turns executing each problem experiment. Also,
  POSIX guarantees that fprintf() will not intermingle output, so
  output was switched to fprintf() instead of libblis_test_fprintf().
- Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
  replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
  wrappers to pthread_mutex_init()/_destroy().
- Changed the implementation of bli_l3_ind_oper_enable_only() to fix
  a race condition; specifically, two threads calling the function with
  the same parameters could lead to a non-deterministic outcome.
- Added #include <pthread.h> to bli_cpuid.c and moved the same in
  bli_arch.c.
- Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
- Added #include <pthread.h> to bli_system.h.
- Added add-copyright.py script to automate adding new copyright lines
  to (and updating existing lines of) source files.
2018-08-26 20:34:30 -05:00
Field G. Van Zee
a7e3a5f975 Fixed uncallable bli_finalize().
Details:
- Previously, bli_finalize_once()--which, like bli_init_once(), was
  implemented in terms of pthread_once()--was using the same
  pthread_once_t control object being used by bli_init(), thus
  guaranteeing that it would never be called as long as BLIS had already
  been initialized. This could manifest as a rather large memory leak to
  any application that attempted to finalize BLIS midway through its
  execution (since BLIS reserves several megabytes of storage for
  packing buffers per thread used). The fix entailed giving each
  function its own pthread_once_t object. Thanks to Devangi Parikh for
  helping track down this very quiet bug.
2018-08-24 14:51:11 -05:00
Field G. Van Zee
16813335bd Merge branch 'amd' into rt
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
  Special thanks to AMD for their contributions to-date, especially with
  regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
  bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
  the extra cost of transposing the microtile in registers, this is
  much faster than using the general storage case when the underlying
  matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
  column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
  kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
  to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
  family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
  - only one vzeroupper instruction is called prior to returning
  - movapd/movupd are used in leiu of movaps/movups for double-real
    microkernels. (Note that single-real microkernels still use
    movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
  now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
  in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
  behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
  sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
  comments.
- Updated LICENSE file to explicitly mention that parts are copyright
  UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
  s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
  and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
  implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
  sum of the squares via dotv(). If there is a floating-point exception
  (FE_OVERFLOW), then the previous (numerically conservative) code is
  used; otherwise, the result of dotv() is square-rooted and stored as
  the result. This new implementation is only enabled when FE_OVERFLOW
  is #defined. If the macro is not #defined, then the previous
  implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
2018-02-21 17:43:32 -06:00
Field G. Van Zee
83316485ce Simplified/fixed self-initialization.
Details:
- Fixed a race condition in self-initialization whereby the bli_is_init
  static variable could be erroneously read as TRUE by thread 1 while
  thread 0 is still executing bli_init_apis(), thus allowing thread 1 to
  use the library before it is actually ready. Thanks to to Minh Quan Ho
  and Devin Matthews for pointing out this issue.
- Part of the solution to the aforementioned race condition was involved
  replacing the runtime initialization of the global scalar constants
  (e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static
  initialization of those same constants. This eliminates the need for
  bli_const_init() altogether. (The static initialization is made concise
  via preprocess macros.)
- Defined bli_gks_query_cntx_noinit(), which behaves just like
  bli_gks_query_cntx(), except that it does not call bli_init_once(). This
  function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and
  bli_memsys_init() so as to not result in any recursion into
  bli_init_once().
- Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants.
  They have no use in BLIS or its test products, and we have little reason
  to believe they are used by others.
- Removed testsuite/out file, which was accidentally committed as part
  of 70640a3.
2017-12-13 14:14:50 -06:00
Field G. Van Zee
70640a3710 Implemented library self-initialization.
Details:
- Defined two new functions in bli_init.c: bli_init_once() and
  bli_finalize_once(). Each is implemented with pthread_once(), which
  guarantees that, among the threads that pass in the same pthread_once_t
  data structure, exactly one thread will execute a user-defined function.
  (Thus, there is now a runtime dependency against libpthread even when
  multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
  computational operations as well as many other functions in BLIS to
  all but guarantee that BLIS will self-initialize through the normal
  use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
  functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
  BLAS compatibility layer to take and return no arguments. (The
  previous API that tracked whether BLIS was initialized, and then
  only finalized if it was initialized in the same function, was too
  cute by half and borderline useless because by default BLIS stays
  initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
  bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
  bli_ind.c. We don't need to track initialization at the sub-API level,
  especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
  level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
  bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
  name. These functions had no use cases within BLIS and likely none
  outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
  main() function, and likewise for standalone test drivers in 'test'
  directory, so that self-initialization is exercised by default.
2017-12-11 17:18:43 -06:00
Nisanth M P
3a44118398 Added AMD copyright line to the changed files in last 3 commits
Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66
2017-12-11 12:41:02 +05:30
Nisanth M P
9c0a3c4c02 Thread Safety: Move bli_init() before and bli_finalize() after main()
BLIS provides APIs to initialize and finalize its global context.
One application thread can finalize BLIS, while other threads
in the application are stil using BLIS.

This issue can be solved by removing bli_finalize() from API.
One way to do this is by getting bli_finalize() to execute by default
after application exits from main().

GCC supports this behaviour with the help of __attribute__((destructor))
added to the function that need to be executed after main exits.

Similarly bli_init() can be made to run before application enters main()
so that application need not call it.

Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac
2017-12-11 12:12:29 +05:30
Field G. Van Zee
42dcd589c3 Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-12-11 12:12:29 +05:30
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
f5962a1aae Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
  into a k x k triangular matrix for the purposes of obtaining an mr x k
  micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
  very large k (depending on the product of mr x kc on that architecture).
  The bug arose from the fact that the test module was triggering the
  allocation of blocks from the internal memory pools, which are limited in
  size. This allocation imposes an implicit assumption that the micro-
  panel being tested with will fit inside, and this assumption is violated
  for large values of k. Arbitrarily large k may now be tested for both
  operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
  statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
  issues.
2017-09-26 17:00:04 -05:00
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
7cd01b71b5 Implemented dynamic allocation for packing buffers.
Details:
- Replaced the old memory allocator, which was based on statically-
  allocated arrays, with one based on a new internal pool_t type, which,
  combined with a new bli_pool_*() API, provides a new abstract data
  type that implements the same memory pool functionality but with blocks
  from the heap (ie: malloc() or equivalent). Hiding the details of the
  pool in a separate API also allows for a much simpler bli_mem.c family
  of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
  sane defaults for the values previously found in bli_config. Those
  values can be overridden by #defining them in bli_config.h the same
  way kernel defaults can be overridden in bli_kernel.h. This file most
  resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
  defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
  blocks in the memory pool. Also added a corresponding query routine to
  the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
  reflection, it seems that the goal of more predictable L1 cache
  replacement behavior is outweighed by the harm caused by non-contiguous
  micro-panels when k % kc != 0. I honestly don't think anyone will even
  miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
  bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
  given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
  which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
  modules. Thanks to Devangi Parikh for pointing out these
  miscalculations.
- Comment, whitespace changes.
2015-06-19 11:31:53 -05:00
Field G. Van Zee
5f93cbe870 Introduced API-level initialization.
Details:
- Added API-level initialization state to _const, _error, _mem, _thread,
  _ind, and _cntl APIs. While this functionality will mostly go unused,
  adding miniscule overhead at init-time, there will be at least once
  instance in the near future where, in order to avoid an infinite loop,
  a certain portion of the initialization will call a query function that
  itself attempts to call bli_init(). API-level initialization will allow
  this later stage to verify that an earlier stage of initialization has
  completed, even if the overall call to bli_init() has not yet returned.
- Added _is_initialized() functions for each API, setting the underlying
  bool_t during _init() and unsetting it during _finalize().
- Comment, whitespace changes.
2015-06-11 18:52:12 -05:00
Field G. Van Zee
b6ee82a3d4 Minor cleanup to bli_init() and friends.
Details:
- Spun-off initialization of global scalar constants to bli_const_init()
  and of threading stuff to bli_thread_init().
- Added some missing _finalize() functions, even when there is nothing
  to do.
2015-06-03 12:14:23 -05:00
Field G. Van Zee
1213f5ceba POSIX thread bugfixes/edits to bli_init.c, _mem.c.
Details:
- Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
  was used to lock access to initialization/finalization actions.
  But everything worked out okay as long as bli_init() was called by
  single-threaded code.
- Changed to static initialization for memory allocator mutex in
  bli_mem.c, and moved mutex to that file (from bli_init.c).
- Fixed some type mismatches in bli_threading_pthreads.c that resulted
  in compiler warnings.
- Fixed a small memory leak with allocated-but-never-freed (and unused)
  pthread_attr_t objects.
- Whitespace changes to bli_init.c and bli_mem.c.
2015-06-02 13:27:47 -05:00
Field G. Van Zee
f1a6b7d028 Reorganized code for induced complex methods.
Details:
- Consolidated most of the code relating to induced complex methods
  (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
  are now enabled on a per-operation basis. The current "available"
  (enabled and implemented) implementation can then be queried on
  an operation basis. Micro-kernel func_t objects as well as blksz_t
  objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
  bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
  and nr blksz_t objects for each cache blocksize (and are NULL for
  register blocksizes). Renamed the sub-blocksize field "sub" to
  "mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
  trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
  identifies an operation. For now, only level-3 id values are defined,
  along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
  are tested (one at a time) rather than only testing the first
  available method.
- Reformated summary at the beginning of testsuite output so that
  blocksize and micro-kernel info is shown for each induced method
  that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
  testsuite output (from approx. 90 to 80).
2015-03-18 15:37:10 -05:00
Tyler Michael Smith
f3d729e504 Added static mutex to bli_init and bli_finalize 2014-11-26 22:25:24 -06:00
Tyler Michael Smith
d71cc79786 Refactored bli_threading files and added support for pthreads 2014-11-26 21:36:39 -06:00
Field G. Van Zee
a41e68e09e Reimplemented BLIS initialization/finalization.
Details:
- Rewrote bli_init() and bli_finalize() with OpenMP critical sections
  for thread-safety. Also added lots of explanatory comments.
- Renamed bli_init_safe() and bli_finalize_safe() with the _auto()
  suffix, and reimplemented for simplicity. Updated all invocations
  in BLAS compatibility layer to use _auto() suffix.
2014-07-17 13:25:56 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Tyler Smith
c51d011083 Initial multithreading support for HERK 2014-03-17 15:00:47 -05:00
Tyler Smith
8d8f4352a4 Added single threaded thread info data structures specifically for gemm and packm 2014-03-10 15:47:28 -05:00
Tyler Smith
2e727a025a Modifying the thread info data structures
This change makes each operation have its own thread info type,
allowing more fine control of threading in operations that have different types of suboperations
2014-03-10 15:14:33 -05:00
Tyler Smith
01b125e815 First pass at adding parallelism to BLIS.
Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.
2014-02-27 11:55:45 -06:00
Field G. Van Zee
2cb13600f9 Updated year in copyright headers to 2014. 2014-01-03 12:29:13 -06:00
Field G. Van Zee
b65cdc57d9 Migrated 'bl2' prefix to 'bli'.
Details:
- Changed all filename and function prefixes from 'bl2' to 'bli'.
- Changed the "blis2.h" header filename to "blis.h" and changed all
  corresponding #include statements accordingly.
- Fixed incorrect association for Fran in CREDITS file.
2013-03-24 20:01:49 -05:00