Commit Graph

89 Commits

Author SHA1 Message Date
Field G. Van Zee
0bd4169ea7 Fixed context-broken dunnington/penryn kernels.
Details:
- Added missing context parameters to several instances where simpler
  kernels, or reference kernels, are called instead of executing the
  main body code contained in the kernel function in question.
- Renamed axpyv and dotv kernel files to use "opt" instead of "int"
  substring, for consistency with level-1f kernels.
2016-04-11 18:08:32 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
d1f8e5d9b2 Merge pull request #60 from esauvage/master
sgemm µkernel for bulldozer : bug correction for k%4 != 0
2016-04-05 12:21:27 -05:00
Etienne Sauvage
c11d28eed8 cgemm µkernel for bulldozer : bug correction for k%4 != 0 2016-04-02 21:15:48 +02:00
Field G. Van Zee
36c3abb05f Merge pull request #58 from esauvage/master
cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…
2016-03-31 10:26:17 -05:00
Etienne Sauvage
917ce75482 cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel 2016-03-30 22:03:09 +02:00
Field G. Van Zee
3090fff64c Merge pull request #44 from esauvage/master
sgemm micro-kernel for FMA4 instruction set
2016-03-28 12:36:25 -05:00
figual
af92773f4f Updated and improved ARMv8 micro-kernels. 2016-03-23 22:07:02 +01:00
Etienne Sauvage
4ca5d5b1fd sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel 2016-03-01 21:33:01 +01:00
Field G. Van Zee
f0a4f41b5a Fixed unimplemented case in core2 sgemm ukernel.
Details:
- Implemented the "beta == 0" case for general stride output for the
  dunnington sgemm micro-kernel. This case had been, up until now,
  identical to the "beta != 0" case, which does not work when the
  output matrix has nan's and inf's. It had manifested as nan residuals
  in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
  Matthews for reporting this bug.
2015-11-12 15:22:50 -06:00
Francisco Igual
a0a7b85ac3 Fixed incomplete code in the double precision ARMv8 microkernel. 2015-10-27 08:59:15 +00:00
Field G. Van Zee
b489152e11 Use vzeroall in haswell micro-kernels. 2015-10-21 14:53:17 -05:00
Field G. Van Zee
fdfe14f1e1 Added support for Intel Haswell/Broadwell.
Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
  and FMA instructions. (Complex support is currently provided by default
  induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
  build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
  configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
2015-07-09 13:52:39 -05:00
Francisco D. Igual
483e4d6a3f Adding armv8a configuration and micro-kernels.
Only sgemm micro-kernel is fully functional at this point.
2014-12-07 20:27:49 +01:00
Field G. Van Zee
7bbc95a54f Added new piledriver micro-kernels.
Details:
- Added new micro-kernels for the AMD piledriver architecture (one
  for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
  acknowledging the influence of the corresponding kernel code in
  OpenBLAS.
2014-10-29 10:52:23 -05:00
Tyler Smith
95cdae65d6 Fixed bug in KNC microkernel where k=0 and beta != 1 2014-10-22 16:30:16 -05:00
Field G. Van Zee
d1e86e1876 More minor tweaks to sandybridge/avx micro-kernel.
Details:
- Re-enabled use of b_next for dgemm and cgemm micro-kernels.
2014-10-12 13:43:47 -05:00
Field G. Van Zee
7b6fe4cae5 Minor tweaks to sandybridge/avx micro-kernels.
Details:
- Changed the MC blocksize for zgemm micro-kernel from 128 to 64.
- Removed usage of b_next in all x86_64/avx gemm micro-kernels.
2014-10-12 12:01:51 -05:00
Field G. Van Zee
a6a156e9fe Added cgemm ukernel for avx/sandybridge.
Details:
- Implemented AVX-based cgemm micro-kernel (via GNU extended inline
  assembly syntax).
- Updated sandybridge configuration accordingly.
2014-10-10 14:26:41 -05:00
Field G. Van Zee
6f8575ab25 Added zgemm ukernel for avx/sandybridge.
Details:
- Implemented AVX-based zgemm micro-kernel (via GNU extended inline
  assembly syntax).
- Updated sandybridge configuration accordingly.
2014-10-10 10:01:45 -05:00
Tyler Smith
7a8ad47fb2 Minor changes to knc configuration, including preference row major storage
Also fixed a bug in the knc micro-kernel where it would fail if k == 0
2014-10-08 15:52:13 -05:00
Field G. Van Zee
e80a453784 Fixed bug introduced by bugfix in 25b258d.
Details:
- We actually need to check alignment of lda*sizeof(double) and NOT
  a+lda because in the latter case, alignment could cancel out and
  still allow the optimized code to run when it shouldn't. Thanks
  to Devin for pointing this out.
2014-09-18 10:24:20 -05:00
Field G. Van Zee
25b258d61f Fixed a non-fatal problem with bugfix in a68b316c.
Details:
- The bugfix in a68b316c was inadvertantly checkin alignment of the
  leading dimension itself, rather than the byte size of the leading
  dimension. Now, we simply check alignment of a+lda.
2014-09-18 10:10:49 -05:00
Field G. Van Zee
a68b316ca4 Fixed alignment bugs in level-1f kernels.
Details:
- Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels
  were attempting to compute problems with unaligned leading dimensions
  with optimized code, rather than (correctly) using the reference
  implementations. Thanks to Devin Matthews for reporting this bug.
2014-09-17 11:10:07 -05:00
Tyler Smith
86fc7e4076 Added bulldozer configuration and updated piledriver micro-kernel 2014-09-15 10:35:46 -05:00
Field G. Van Zee
bc1d86b2d4 Sandy Bridge configuration, micro-kernel update.
Details:
- Minor updates to bli_config and bli_kernel.h for sandybridge
  configuration.
- Renamed existing AVX intrinsic-based micro-kernel file to
  bli_gemm_int_d8x4.c.
- Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based
  gemm micro-kernels for single- and double-precision real.
2014-08-07 19:01:20 -05:00
Field G. Van Zee
45692e3ad4 Reverted some accidental changes.
Details:
- Reverted some changes that were unintentionally included in the
  previous commit (9526ce98). Thanks to Tony Kelman for pointing
  this out. (Note: a few select changes were not reverted.)
2014-08-07 13:21:15 -05:00
Field G. Van Zee
9526ce9881 Updated copyright headers of emscripten configuration files. 2014-08-06 14:15:34 -05:00
Field G. Van Zee
c73261f17e More minor cleanups post-copyright update. 2014-07-14 16:23:51 -05:00
Field G. Van Zee
2a09d24463 Reverted power7 symlinks destroyed by sed script.
Details:
- Reverted two symlinks, in kernels/power7/3/test, back to being symlinks
  after recursive-sed.sh mistakenly replaced them with copies of the
  actual files to which they referred. Meant to include this in previous
  commit.
2014-07-14 16:17:09 -05:00
Field G. Van Zee
7ed415824d Updated copyright headers (continued).
Details:
- Inserted "at Austin" into third clause of license declarations.
  Meant to include this change in previous commit.
2014-07-14 16:14:33 -05:00
Field G. Van Zee
5c2c6c8561 Updated copyright headers to contain "at Austin".
Details:
- Updated copyright headers to include "at Austin" in the name of the
  University of Texas.
- Updated the copyright years of a few headers to 2014 (from 2011 and
  2012).
2014-07-14 16:05:03 -05:00
Marat Dukhan
b693b0cddc [SC]AXPY kernels for PNaCl 2014-06-22 13:44:25 -07:00
Marat Dukhan
020a831bc5 Code clean-up in PNaCl port 2014-06-19 00:58:26 -07:00
Marat Dukhan
491be4f91e Optimized dot product kernels for PNaCl 2014-06-19 00:45:44 -07:00
Marat Dukhan
b2ffb4de8b Reformatted PNaCl GEMM kernels 2014-06-15 18:41:30 -04:00
Marat Dukhan
6de2d472d9 CGEMM and ZGEMM kernels for PNaCl 2014-06-15 08:44:31 -04:00
Marat Dukhan
f064711a5e SGEMM and DGEMM kernels for PNaCl 2014-06-15 06:27:37 -04:00
Tyler Smith
00f232f8ed Added single-precision micro-kernel for Knights Corner aka MIC aka Xeon Phi 2014-06-02 13:40:57 -05:00
Field G. Van Zee
3fc60e4914 Fixed ldim alignment bug in core2 gemm ukernel.
Details:
- Fixed a bug in the dunnington/core2 gemm micro-kernels that resulted in
  a segmentation fault if a column-stored matrix's starting address was
  aligned, but its leading dimension was such that its second column was
  unaligned. Basically, the micro-kernel was assuming that aligned load
  instructions were safe when they actually were not. An extra condition
  that checks the alignment of cs_c (ie: the leading dimension in the
  column storage case) has now been added. Thanks to Michael Lehn for
  reporting this bug.
2014-05-21 11:34:42 -05:00
Tyler Michael Smith
20e24430a7 Some fixes for the bgq kernels 2014-04-08 17:50:44 +00:00
Tyler Smith
2b6848b239 Merge http://github.com/flame/blis
Conflicts:
	kernels/bgq/1/bli_axpyv_opt_var1.c
	kernels/bgq/1/bli_dotv_opt_var1.c
2014-04-04 09:54:54 -05:00
Tyler Michael Smith
4e3eb39aca Some fixes to the bgq config
MR and NR for double complex were wrong
Default fusing factor for double precision was wrong as well
2014-04-04 14:50:03 +00:00
Field G. Van Zee
21a0efb33d Fixed follow-up to issue #6. 2014-04-03 16:38:44 -05:00
Field G. Van Zee
c318157a9b Fixed issue #6 (incorrect 'restrict' usage).
Details:
- Fixed improper usage of restrict keyword in axpyv and dotv bgq kernels.
  (However, there may be other instances of similar misuse elsewhere in
  BLIS.) Thanks to Jeff Hammond for reporting this issue.
2014-04-03 16:24:34 -05:00
Field G. Van Zee
b5150a1bf3 Added #include "arm_neon.h" to ARM gemm ukernel.
Details:
- Inserted #include "arm_neon.h" into gemm ukernel source file for
  arm/neon. Thanks to Jean-Michel Hautbois for suggesting this fix.
2014-04-03 12:25:45 -05:00
Field G. Van Zee
d27b4f690c Use generic paths for toolchain in POWER7.
Details:
- Fixed issue #4. Thanks to Jeff Hammond for contributing changes.
2014-04-01 12:57:24 -05:00
Tyler Michael Smith
73b3db5948 Some fixes for the bgq configuration 2014-03-26 15:39:05 +00:00
Field G. Van Zee
fde5f1fdec Added extensive support for configuration defaults.
Details:
- Standard names for reference kernels (levels-1v, -1f and 3) are now
  macro constants. Examples:
    BLIS_SAXPYV_KERNEL_REF
    BLIS_DDOTXF_KERNEL_REF
    BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel
  with a common base name; [sdcz] datatype flavors of each kernel or
  micro-kernel (level-1v, -1f, or 3) may now be named independently.
  This means you can now, if you wish, encode the datatype-specific
  register blocksizes in the name of the micro-kernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left
  undefined in bli_kernel.h will default to the corresponding reference
  implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
  it will be defined to be BLIS_DGEMM_UKERNEL_REF.
- Developers no longer need to name level-1v/-1f kernels with multiple
  datatype chars to match the number of types the kernel WOULD take in
  a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
  sufficient, as in bli_daxpyv_opt().
- There is no longer a need to define an obj_t wrapper to go along with
  your level-1v/-1f kernels. The framework now prvides a _kernel()
  function which serves as the obj_t wrapper for whatever kernels are
  specified (or defaulted to) via bli_kernel.h
- Developers no longer need to prototype their kernels, and thus no
  longer need to include any prototyping headers from within
  bli_kernel.h. The framework now generates kernel prototypes, with the
  proper type signature, based on the kernel names defined (or defaulted
  to) via bli_kernel.h.
- If the complex datatype x (of [cz]) implementation of the gemm micro-
  kernel is left undefined by bli_kernel.h, but its same-precision real
  domain equivalent IS defined, BLIS will use a 4m-based implementation
  for the datatype x implementations of all level-3 operations, using
  only the real gemm micro-kernel.
2014-02-25 13:34:56 -06:00
Field G. Van Zee
6363a9f658 Added level-3 support for complex via 4m-/3m.
Details:
- Added the ability to induce complex domain level-3 operations via new
  virtual complex micro-kernels which are implemented via only real
  domain micro-kernels. Two new implementations are provided: 4m and 3m.
  4m implements complex matrix multiplication in terms of four real
  matrix multiplications, where as 3m uses only three and thus is
  capable of even higher (than peak) performance. However, the 3m method
  has somewhat weaker numerical properties, making it less desirable
  in general.
- Further refined packing routines, which were recently revamped, and
  added packing functionality for 4m and 3m.
- Some modifications to trmm and trsm macro-kernels to facilitate indexing
  into micro-panels which were packed for 4m/3m virtual kernels.
- Added 4m and 3m interfaces for each level-3 operation.
- Various other minor changes to facilitate 4m/3m methods.
2014-02-19 17:00:52 -06:00