Commit Graph

24 Commits

Author SHA1 Message Date
Field G. Van Zee
00e14cb6d8 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-07-29 14:24:34 -05:00
Field G. Van Zee
72f6ed0637 Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
  BLIS_INLINE instead of the static keyword. This allows blis.h to
  use a different keyword (inline) to define these functions when
  compiling with C++, which might otherwise trigger "defined but
  not used" warning messages. Thanks to Giorgos Margaritis for
  reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
  hardware auto-detection facility, to unconditionally #define
  BLIS_INLINE to the static keyword (since we know BLIS will be
  compiled with C, not C++):
    build/detect/config/config_detect.c
    frame/base/bli_arch.c
    frame/base/bli_cpuid.c
- CREDITS file update.
2020-07-03 17:55:54 -05:00
kdevraje
13806ba3b0 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041
2019-05-27 16:24:43 +05:30
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
375eb30b0a Added mixed-precision support to 1m method.
Details:
- Lifted the constraint that 1m only be used when all operands' storage
  datatypes (along with the computation datatype) are equal. Now, 1m may
  be used as long as all operands are stored in the complex domain. This
  change largely consisted of adding the ability to pack to 1e and 1r
  formats from one precision to another. It also required adding logic
  for handling complex values of alpha to bli_packm_blk_var1_md()
  (similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
  bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
  ukernel output preference field being read. Previously, the preference
  for the native complex ukernel was being read instead of the pref for
  the native real domain ukernel. This bug would not manifest if the
  preference for the native complex ukernel happened to be equal to that
  of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
  module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
  schemas are always read from the context, rather than trying to
  sometimes embed them directly to the A and B objects. (They are still
  embedded, but now uniformly only after reading the schemas from the
  context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
  and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
  consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
  bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
  gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
  scal2ris, and xpbyris (and their conjugating counterparts) to
  explicitly support three operand types and updated invocations to
  xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
  storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
  frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.
2018-12-03 17:49:52 -06:00
Field G. Van Zee
c92762ecdc Added option of slab or rr partitioning in jr/ir.
Details:
- Updated existing macrokernel function names and definitions to
  explicitly use slab assignment of micropanels to threads, then created
  duplicate versions of macrokernels that explicitly use round-robin
  assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
  were not substantially updated in this commit because they are
  currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
  explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
  and packm function pointers depending on which method (slab or rr) was
  enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
  option (-m [slab|rr] for short), which allows the user to explicitly
  request either slab or round-robin assignment (partitioning) of
  micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.
2018-10-07 20:30:32 -05:00
Field G. Van Zee
ac18949a4b Multithreading optimizations for l3 macrokernels.
Details:
- Adjusted the method by which micropanels are assigned to threads in
  the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
  employ contiguous "slab" partitioning rather than interleaved (round
  robin) partitioning. The new partitioning schemes and related details
  for specific families of operations are listed below:
  - gemm: slab partitioning.
  - herk: slab partitioning for region corresponding to non-triangular
          region of C; round robin partitioning for triangular region.
  - trmm: slab partitioning for region corresponding to non-triangular
          region of B; round robin partitioning for triangular region.
          (NOTE: This affects both left- and right-side macrokernels:
          trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
  - trsm: slab partitioning.
          (NOTE: This only affects only left-side macrokernels trsm_ll,
          trsm_lu; right-side macrokernels were not touched.)
  Also note that the previous macrokernels were preserved inside of
  the 'other' directory of each operation family directory (e.g.
  frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
  and fixed a stale function pointer type in blx_gemm_int.c
  (gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
  and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
  and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().
2018-09-30 18:54:56 -05:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
ecbebe7c2e Defined rntm_t to relocate cntx_t.thrloop (#235).
Details:
- Defined a new struct datatype, rntm_t (runtime), to house the thrloop
  field of the cntx_t (context). The thrloop array holds the number of
  ways of parallelism (thread "splits") to extract per level-3
  algorithmic loop until those values can be used to create a
  corresponding node in the thread control tree (thrinfo_t structure),
  which (for any given level-3 invocation) usually happens by the time
  the macrokernel is called for the first time.
- Relocating the thrloop from the cntx_t remedies a thread-safety issue
  when invoking level-3 operations from two or more application threads.
  The race condition existed because the cntx_t, a pointer to which is
  usually queried from the global kernel structure (gks), is supposed to
  be a read-only. However, the previous code would write to the cntx_t's
  thrloop field *after* it had been queried, thus violating its read-only
  status. In practice, this would not cause a problem when a sequential
  application made a multithreaded call to BLIS, nor when two or more
  application threads used the same parallelization scheme when calling
  BLIS, because in either case all application theads would be using
  the same ways of parallelism for each loop. The true effects of the
  race condition were limited to situations where two or more application
  theads used *different* parallelization schemes for any given level-3
  call.
- In remedying the above race condition, the application or calling
  library can now specify the parallelization scheme on a per-call basis.
  All that is required is that the thread encode its request for
  parallelism into the rntm_t struct prior to passing the address of the
  rntm_t to one of the expert interfaces of either the typed or object
  APIs. This allows, for example, one application thread to extract 4-way
  parallelism from a call to gemm while another application thread
  requests 2-way parallelism. Or, two threads could each request 4-way
  parallelism, but from different loops.
- A rntm_t* parameter has been added to the function signatures of most
  of the level-3 implementation stack (with the most notable exception
  being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
  APIs. (A few internal functions gained the rntm_t* parameter even
  though they currently have no use for it, such as bli_l3_packm().)
  This required some internal calls to some of those functions to
  be updated since BLIS was already using those operations internally
  via the expert interfaces. For situations where a rntm_t object is
  not available, such as within packm/unpackm implementations, NULL is
  passed in to the relevant expert interfaces. This is acceptable for
  now since parallelism is not obtained for non-level-3 operations.
- Revamped how global parallelism is encoded. First, the conventional
  environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
  read once, at library initialization. (Thanks to Nathaniel Smith for
  suggesting this to avoid repeated calls getenv(), which can be slow.)
  Those values are recorded to a global rntm_t object. Public APIs, in
  bli_thread.c, are still available to get/set these values from the
  global rntm_t, though now the "set" functions have additional logic
  to ensure that the values are set in a synchronous manner via a mutex.
  If/when NULL is passed into an expert API (meaning the user opted to
  not provide a custom rntm_t), the values from the global rntm_t are
  copied to a local rntm_t, which is then passed down the function stack.
  Calling a basic API is equivalent to calling the expert APIs with NULL
  for the cntx and rntm parameters, which means the semantic behavior of
  these basic APIs (vis-a-vis multithreading) is unchanged from before.
- Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
  and reimplemented, with the function now being able to treat the
  incoming rntm_t in a manner agnostic to its origin--whether it came
  from the application or is an internal copy of the global rntm_t.
- Removed various global runtime APIs for setting the number of ways of
  parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
  as the corresponding "get" functions. The new model simplifies these
  interfaces so that one must either set the total number of threads, OR
  set all of the ways of parallelism for each loop simultaneously (in a
  single function call).
- Updated sandbox/ref99 according to above changes.
- Rewrote/augmented docs/Multithreading.md to document the three methods
  (and two specific ways within each method) of requesting parallelism
  in BLIS.
- Removed old, disabled code from bli_l3_thrinfo.c.
- Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.
2018-07-17 18:37:32 -05:00
Field G. Van Zee
87db5c048e Changed usage of virtual microkernel slots in cntx.
Details:
- Changed the way virtual microkernels are handled in the context.
  Previously, there were query routines such as bli_cntx_get_l3_ukr_dt()
  which returned the native ukernel for a datatype if the method was
  equal to BLIS_NAT, or the virtual ukernel for that datatype if the
  method was some other value. Going forward, the context native and
  virtual ukernel slots will both be initialized to native ukernel
  function pointers for native execution, and for non-native execution
  the virtual ukernel pointer will be something else. This allows us
  to always query the virtual ukernel slot (from within, say, the
  macrokernel) without needing any logic in the query routine to decide
  which function pointer (native or virtual) to return. (Essentially,
  the logic has been shifted to init-time instead of compute-time.)
  This scheme will also allow generalized virtual ukernels as a way
  to insert extra logic in between the macrokernel and the native
  microkernel.
- Initialize native contexts (in bli_cntx_ref.c) with native ukernel
  function addresses stored to the virtual ukernel slots pursuant to
  the above policy change.
- Renamed all static functions that were native/virtual-ambiguous, such
  as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt()
  pursuant to the above polilcy change. Those routines now use the
  substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All
  of these functions were static functions defined in bli_cntx.h, and
  most uses were in level-3 front-ends and macrokernels.
- Deprecated anti_pref bool_t in context, along with related functions
  such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's
  panel-block execution is disabled.
2018-06-12 19:38:37 -05:00
Field G. Van Zee
9588625c43 Renamed "next micropanel" macros in _l3_thrinfo.h.
Details:
- Renamed several macros defined in bli_l3_thrinfo.h designed to compute
  the values of a_next and b_next to insert into an auxinfo_t struct in
  level-3 macrokernels. (Previously, the macros did not use a bli_
  prefix.)
- Updated instances of above macro usage within various macrokernels.
2018-05-30 15:19:53 -05:00
Field G. Van Zee
4b36e85be9 Converted function-like macros to static functions.
Details:
- Converted most C preprocessor macros in bli_param_macro_defs.h and
  bli_obj_macro_defs.h to static functions.
- Reshuffled some functions/macros to bli_misc_macro_defs.h and also
  between bli_param_macro_defs.h and bli_obj_macro_defs.h.
- Changed obj_t-initializing macros in bli_type_defs.h to static
  functions.
- Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
  bli_constants.h.
- Whitespace changes in select files (four spaces to single tab).
2018-05-08 14:26:30 -05:00
Field G. Van Zee
75d0d1057d Renamed various datatype-related macros/functions.
Details:
- Renamed the following macros in bli_obj_macro_defs.h and
  bli_param_macro_defs.h:
  - bli_obj_datatype()                 -> bli_obj_dt()
  - bli_obj_target_datatype()          -> bli_obj_target_dt()
  - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
  - bli_obj_set_datatype()             -> bli_obj_set_dt()
  - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
  - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
  - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
  - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
  - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
  - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
- Renamed the following functions in bli_obj.c:
  - bli_datatype_size()                -> bli_dt_size()
  - bli_datatype_string()              -> bli_dt_string()
  - bli_datatype_union()               -> bli_dt_union()
- Removed a pair of old level-1f penryn intrinsics kernels that were no
  longer in use.
2018-04-30 14:57:33 -05:00
Field G. Van Zee
39be59f2a8 Replaced several macros with static function APIs.
Details:
- Reimplemented several sets of get/set-style preprocessor macros with
  static functions, including those in the following frame/base headers:
  auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in
  frame/thread were touched as well: mutex_*, thrcomm, and thrinfo.
2017-12-07 17:35:20 -06:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
bdc0a264d2 Adjusted stride selection of ct in macrokernels.
Details:
- Updated the changes introduced in 618f433 so that the strides of the
  temporary microtile ct used in the macrokernels is determined based
  on the storage preference of the microkernel (via the new functions
  below), rather than the strides of c. In almost all cases, presently,
  this change results in no net effect, as a high-level optimization
  in the _front() functions aligns the storage of c to that of the
  microkernel's preference. However, I encountered some cases where
  this is not always the case in some development code that has yet
  to be committed, and therefore I'm generalizing the framework code
  in advance.
- Defined two new functions in bli_cntx.c:
    bli_cntx_l3_ukr_prefers_rows_dt()
    bli_cntx_l3_ukr_prefers_cols_dt()
  which return bool_t's based on the current micro-kernel's storage
  preferences. For induced methods, the preference of the underlying
  real domain microkernel is returned.
- Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
  by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
  the above functions, rather than querying the preferences of the
  native microkernel directly (which did the wrong thing for induced
  methods).
2016-11-16 14:13:08 -06:00
Field G. Van Zee
35509818cb Added, moved some thread barriers.
Details:
- Removed thread barriers from the end of the loop bodies of
  bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
  and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
  end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
  in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.
2016-08-31 17:34:15 -05:00
Field G. Van Zee
701b9aa3ff Redesigned control tree infrastructure.
Details:
- Altered control tree node struct definitions so that all nodes have the
  same struct definition, whose primary fields consist of a blocksize id,
  a variant function pointer, a pointer to an optional parameter struct,
  and a pointer to a (single) sub-node. This unified control tree type is
  now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
  they represent, such that, for example, packing operations are now
  associated with nodes that are "inline" in the tree, rather than off-
  shoot braches. The original tree for the classic Goto gemm algorithm was
  expressed (roughly) as:

    blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                         |           |
                         -> packb    -> packa

  and now, the same tree would look like:

    blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

  Specifically, the packb and packa nodes perform their respective packing
  operations and then recurse (without any loop) to a subproblem. This means
  there are now two kinds of level-3 control tree nodes: partitioning and
  non-partitioning. The blocked variants are members of the former, because
  they iteratively partition off submatrices and perform suboperations on
  those partitions, while the packing variants belong to the latter group.
  (This change has the effect of allowing greatly simplified initialization
  of the nodes, which previously involved setting many unused node fields to
  NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
  connective structure of control trees. That is, packm nodes are no longer
  off-shoot branches of the main algorithmic nodes, but rather connected
  "inline".
- Simplified control tree creation functions. Partitioning nodes are created
  concisely with just a few fields needing initialization. By contrast, the
  packing nodes require additional parameters, which are stored in a
  packm-specific struct that is tracked via the optional parameters pointer
  within the control tree struct. (This parameter struct must always begin
  with a uint64_t that contains the byte size of the struct. This allows
  us to use a generic function to recursively copy control trees.) gemm,
  herk, and trmm control tree creation continues to be consolidated into
  a single function, with the operation family being used to select
  among the parameter-agnostic macro-kernel wrappers. A single routine,
  bli_cntl_free(), is provided to free control trees recursively, whereby
  the chief thread within a groups release the blocks associated with
  mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
  function pointer stored in the current control tree node (rather than
  index into a local function pointer array). Before being invoked, these
  function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
  families) or trsm_voft (for trsm family) type, which is defined in
  frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
  through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
  from routines as a function of the variant's matrix operands. gemm and
  herk always move forward, while trmm and trsm move in a direction that
  is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
  each of which takes additional arguments and hides complexity in managing
  the difference between the way ranges are computed for the four families
  of operations.
- Simplified level-3 blocked variants according to the above changes, so that
  the only steps taken are:
  1. Query partitioning direction (forwards or backwards).
  2. Prune unreferenced regions, if they exist.
  3. Determine the thread partitioning sub-ranges.
  <begin loop>
    4. Determine the partitioning blocksize (passing in the partitioning
       direction)
    5. Acquire the curren iteration's partitions for the matrices affected
       by the current variants's partitioning dimension (m, k, n).
    6. Call the subproblem.
  <end loop>
- Instantiate control trees once per thread, per operation invocation.
  (This is a change from the previous regime in which control trees were
  treated as stateless objects, initialized with the library, and shared
  as read-only objects between threads.) This once-per-thread allocation
  is done primarily to allow threads to use the control tree as as place
  to cache certain data for use in subsequent loop iterations. Presently,
  the only application of this caching is a mem_t entry for the packing
  blocks checked out from the memory broker (allocator). If a non-NULL
  control tree is passed in by the (expert) user, then the tree is copied
  by each thread. This is done in bli_l3_thread_decorator(), in
  bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
  of the operation being executed. For example, gemm, hemm, and symm are
  all part of the gemm family, while herk, syrk, her2k, and syr2k are
  all part of the herk family. Knowing the operation's family is necessary
  when conditionally executing the internal (beta) scalar reset on on
  C in blocked variant 3, which is needed for gemm and herk families,
  but must not be performed for the trmm family (because beta has only
  been applied to the current row-panel of C after the first rank-kc
  iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
  to comform with the new control tree design, and renamed the macro-
  kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
  bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
  frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
  optimization in the various level-3 front-ends was not being applied
  properly when the context indicated that execution would be via an
  induced method. (Before, we always checked the native micro-kernel
  corresponding to the datatype being executed, whereas now we check
  the native micro-kernel corresponding to the datatype's real projection,
  since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
  complex implementations. Previously, it was always tested, provided that
  the c/z datatypes were enabled. However, some configurations use
  reference micro-kernels for complex datatypes, and testing these
  implementations can slow down the testsuite considerably.
2016-08-26 19:04:45 -05:00
Field G. Van Zee
31def12e26 First phase of control tree redesign.
Details:
- These changes constitute the first set of changes in preparation to
  revamping the structure and use of control trees in BLIS. Modifications
  in this commit don't affect the control tree code yet, but rather lay
  the groundwork.
- Defined wrappers for the following functions, where the the wrappers
  each take a direction parameter of a new enumerated type (BLIS_BWD or
  BLIS_FWD), dir_t, and executes the correct underlying function.
  - bli_acquire_mpart_*() and _vpart_*()
  - bli_*_determine_kc_[fb]()
  - bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
- Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
  blocked variants for trmm and trsm, and renamed gemm and herk variants
  accordingly. The direction is now queried via routines such as
  bli_trmm_direct(), which deterines the direction from the implied side
  and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
- Defined wrappers to parameter-specific macrokernels for herk, trmm, and
  trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
  macrokernel based on the implied parameters. The same logic used to
  choose the dir_t in _direct() functions is used here.
- Simplified the function pointer arrays in _int() functions given the
  consolidation and dir_t querying mentioned above.
- Function signature (whitespace) reformatting for various functions.
- Removed old code in various 'old' directories.
2016-06-30 15:19:20 -05:00
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
e2e9d64a63 Load balance thread ranges for arbitrary diagonals.
Details:
- Expanded/updated interface for bli_get_range_weighted() and
  bli_get_range() so that the direction of movement is specified in the
  function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
  and also so that the object being partitioned is passed instead of an
  uplo parameter. Updated invocations in level-3 blocked variants, as
  appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
  carefully take into account the location of the diagonal when computing
  ranges so that the area of each subpartition (which, in all present
  level-3 operations, is proportional to the amount of computation
  engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
  variants:
    bli_<oper>_prune_unref_mparts_[mnk]()
  where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
  dimension is being partitioned. These routines call a more basic
  routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
  regions from matrices and simultaneously adjust other matrices which
  share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
  new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
  bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
  bli_get_range_*() and bli_get_range_weighted_*() under a range of
  conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
  array field so that dimensions could be queried via index constant
  (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
  macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
  and bli_round_to_mult(), which rounds a value to the nearest multiple
  of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
  bli_obj_row_off(), bli_obj_col_off().
2015-09-24 12:14:03 -05:00
Field G. Van Zee
ee129c6b02 Fixed bugs in _get_range(), _get_range_weighted().
Details:
- Fixed some bugs that only manifested in multithreaded instances of
  some (non-gemm) level-3 operations. The bugs were related to invalid
  allocation of "edge" cases to thread subpartitions. (Here, we define
  an "edge" case to be one where the dimension being partitioned for
  parallelism is not a whole multiple of whatever register blocksize
  is needed in that dimension.) In BLIS, we always require edge cases
  to be part of the bottom, right, or bottom-right subpartitions.
  (This is so that zero-padding only has to happen at the bottom, right,
  or bottom-right edges of micro-panels.) The previous implementations
  of bli_get_range() and _get_range_weighted() did not adhere to this
  implicit policy and thus produced bad ranges for some combinations of
  operation, parameter cases, problem sizes, and n-way parallelism.
- As part of the above fix, the functions bli_get_range() and
  _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
  and _b2t suffixes, similar to the partitioning functions. This is
  an easy way to make sure that the variants are calling the right
  version of each function. The function signatures have also been
  changed slightly.
- Comment/whitespace updates.
- Removed unnecessary '/' from macros in bli_obj_macro_defs.h.
2015-06-10 12:53:28 -05:00
Field G. Van Zee
26a4b8f6f9 Implemented 3m2, 3m3 induced algorithms (gemm only).
Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
  support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
  as an argument instead of computing it locally. Exception: for trmm,
  is_p must be computed locally, since it changes for triangular
  packed matrices. Also exposed is_p in interface to dt-specific
  packm_blk_var2 (and _var1, even though it does not use imaginary
  stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
  they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
  rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
  and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
  kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
  frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
  are obtained from the func_t objects rather than the cpp macros that
  define the function names.
- Updated test/3m4m driver, Makefile, and run script.
2015-04-01 10:44:54 -05:00