Commit Graph

157 Commits

Author SHA1 Message Date
praveeng
f2e7ea113a conflicts merge for bli_kernel.h
Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0
2016-10-06 12:35:30 +05:30
sthangar
133983c36f code clean up in bli_kernel.h
Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d
2016-10-06 11:26:22 +05:30
Field G. Van Zee
b922d75634 Avoid compiling BLAS/CBLAS files when disabled.
Details:
- Updated the top-level Makefile, build/config.mk.in template, and
  configure script so that object files corresponding to source files
  belonging to the BLAS compatibility layer are not compiled (or archived)
  when the compatibility layer is disabled. (Same for CBLAS.) Thanks
  to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
  of converting (overwriting) some, such as enable_blas2blis and
  enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
  now stored in new variables that live alongside the originals (with the
  suffix "_01").  This is convenient since some values need to be
  sed-substituted into the config.mk.in template, which requires "yes" or
  "no", while some need to be written to the bli_config.h.in template,
  which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1
2016-09-15 12:24:07 +05:30
Pradeep Rao
69826110ba Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging 2016-09-14 03:26:25 -04:00
Field G. Van Zee
121c39d455 Added complex gemm micro-kernels for haswell.
Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
  architectures. As with their real domain brethren, these kernels perfer
  row storage, (though this doesn't affect most users due to high-level
  optimizations in most level-3 operations that induce a transpose to
  whatever storage preference the kernel may have).
2016-09-05 13:11:42 -05:00
sthangar
64598ee4cf fixed the symlink issue
Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955
2016-08-31 12:54:50 +05:30
sthangar
fdc6639023 Placed 1 and 1f AMD optimized AVX routines under zen folder
Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328
2016-08-29 10:43:38 +05:30
Kiran Varaganti
a58dd35ed7 Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision
Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9
2016-08-26 14:55:12 +05:30
praveeng
1aa77dfc1d Merge master code as on 2016_07_21 to amd-staging branch by praveeng
Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344
2016-07-21 14:23:41 +05:30
sthangar
9101a9c880 Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels
Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b
2016-07-13 16:51:14 +05:30
Field G. Van Zee
096895c5d5 Reorganized code, APIs related to multithreading.
Details:
- Reorganized code and renamed files defining APIs related to multithreading.
  All code that is not specific to a particular operation is now located in a
  new directory: frame/thread. Code is now organized, roughly, by the
  namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
  thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
  also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
  functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
  We now have the following API namespaces:
  - bli_thrinfo_*(): functions related to thrinfo_t objects
  - bli_thrcomm_*(): functions related to thrcomm_t objects.
  - bli_thread_*(): general-purpose functions, such as initialization,
    finalization, and computing ranges. (For now, some macros, such as
    bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
    bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
    appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
  consistent with the thread-related macros that were also renamed).
- Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
  #undef was a temporary fix to some macro defaults which were being applied
  in the wrong order, which was recently fixed.
2016-06-06 13:32:04 -05:00
Tyler Smith
4dcd37eb1b fixing knc simd align size 2016-05-10 16:28:59 -05:00
Field G. Van Zee
c3a4d39d03 Updates to haswell gemm micro-kernels.
Details:
- Added two new sets of [sd]gemm micro-kernels for haswell architectures,
  one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
- Changed the haswell configuration to use the 6x16/6x8 micro-kernels
  by default.
- Updated various Makefiles, in test, test/3m4m, and testsuite.
2016-05-04 17:22:56 -05:00
Field G. Van Zee
0b01d355ae Miscellaneous cleanups, fixes to recent commits.
Details:
- Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
  manifested when non-reference level-1f kernels were used.
- Added an #undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
  configuration to prevent a compile-time warning until I can figure out
  the proper permanent fix.
- Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
  path (into 'other' directory). _ref_var2 is used by default, which is
  the variant that is built on axpyf and dotxf instead of dotaxpyv.
- Removed section of frame/include/bli_config_macro_defs.h pertaining to
  mixed datatype support.
2016-04-27 15:21:10 -05:00
Devin Matthews
0e1a9821d8 Add configure options and generate bli_config.h automatically.
Options to configure have been added for:
- Setting the internal BLIS and BLAS/CBLAS integer sizes.
- Enabling and disabling the BLAS and CBLAS layers.

Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.

Lastly, support for OMP in clang has been added (closes #56).
2016-04-19 11:44:37 -05:00
Field G. Van Zee
dd62080cea Compile-time fix to bgq l1f kernels.
Details:
- Fixed an old reference to bli_daxpyf_fusefac, which no longer exists,
  by replacing it with the axpyf fusing factor (8), and cleaned up the
  relevant section of config/bgq/bli_kernel.h.
- Removed most of the details of the level-3 kernels from the template
  kernel code in config/template/kernels/3 and replaced it with a
  reference to the relevant kernel wiki maintained on the BLIS github
  website.
2016-04-15 11:15:41 -05:00
Field G. Van Zee
f756dbfa0d Removed stale #include from bgq configuration.
Details:
- Removed an old #include statement ("bli_gemm_8x8.h") from the
  bli_kernel.h file in the bgq configuration. It turns out this
  file was no longer needed even prior to 537a1f4.
2016-04-13 11:25:33 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
Field G. Van Zee
36c3abb05f Merge pull request #58 from esauvage/master
cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…
2016-03-31 10:26:17 -05:00
Etienne Sauvage
917ce75482 cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel 2016-03-30 22:03:09 +02:00
Field G. Van Zee
64b41fa554 Merge pull request #54 from devinamatthews/more_config_opts
More config opts
2016-03-29 15:19:41 -05:00
Devin Matthews
0171ad5899 Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW. 2016-03-28 13:55:06 -05:00
Field G. Van Zee
3090fff64c Merge pull request #44 from esauvage/master
sgemm micro-kernel for FMA4 instruction set
2016-03-28 12:36:25 -05:00
Devin Matthews
8442d65c9e Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures. 2016-03-25 20:06:48 -05:00
Devin Matthews
76099f20be Add threading option to configure. 2016-03-25 17:22:58 -05:00
Devin Matthews
2bd036f1f9 Fix configuration issue where instruction set flags are not specified for debug builds. 2016-03-25 12:16:49 -05:00
figual
af92773f4f Updated and improved ARMv8 micro-kernels. 2016-03-23 22:07:02 +01:00
Devin Matthews
d226dfa051 Add several changes to the build system.
1) Add -- options.
2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
4) Add make V=[0,1] option to control build verbosity.
2016-03-05 16:18:14 -06:00
Etienne Sauvage
4ca5d5b1fd sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel 2016-03-01 21:33:01 +01:00
Etienne Sauvage
627d59b5ba symbolic link for bulldozer configuration to kernels 2016-02-29 21:53:12 +01:00
Tony Kelman
3d0fae810d Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer
to fix linking issue mentioned in #37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI
2016-02-25 23:24:03 -08:00
Francisco Igual
a0a7b85ac3 Fixed incomplete code in the double precision ARMv8 microkernel. 2015-10-27 08:59:15 +00:00
Field G. Van Zee
ef0fbbbdb6 Merge branch 'master' of github.com:flame/blis 2015-07-09 13:54:54 -05:00
Field G. Van Zee
fdfe14f1e1 Added support for Intel Haswell/Broadwell.
Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
  and FMA instructions. (Complex support is currently provided by default
  induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
  build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
  configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.
2015-07-09 13:52:39 -05:00
Field G. Van Zee
d4b891369c Added 'carrizo' configuration.
Details:
- Added a new configuration for AMD Excavator-based hardware also known
  as Carrizo when referring to the entire APU. This configuration uses
  the same micro-kernels as the piledriver, but with different
  cache blocksizes.
2015-07-07 10:06:53 -05:00
Field G. Van Zee
7cd01b71b5 Implemented dynamic allocation for packing buffers.
Details:
- Replaced the old memory allocator, which was based on statically-
  allocated arrays, with one based on a new internal pool_t type, which,
  combined with a new bli_pool_*() API, provides a new abstract data
  type that implements the same memory pool functionality but with blocks
  from the heap (ie: malloc() or equivalent). Hiding the details of the
  pool in a separate API also allows for a much simpler bli_mem.c family
  of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
  sane defaults for the values previously found in bli_config. Those
  values can be overridden by #defining them in bli_config.h the same
  way kernel defaults can be overridden in bli_kernel.h. This file most
  resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
  defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
  blocks in the memory pool. Also added a corresponding query routine to
  the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
  reflection, it seems that the goal of more predictable L1 cache
  replacement behavior is outweighed by the harm caused by non-contiguous
  micro-panels when k % kc != 0. I honestly don't think anyone will even
  miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
  bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
  given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
  which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
  modules. Thanks to Devangi Parikh for pointing out these
  miscalculations.
- Comment, whitespace changes.
2015-06-19 11:31:53 -05:00
Field G. Van Zee
309717c8eb More tweaks to test/3m4m, configurations.
Details:
- Fixed incorrect number of mc_x_kc memory blocks in
  sandybridge/bli_config.h.
- Enabled OpenMP multithreding in piledriver/bli_config.h.
- More updates to test/3m4m driver files.
2015-04-03 19:28:49 -05:00
Field G. Van Zee
349e075ad6 Tweaks to sandybridge config, test/3m4m driver.
Details:
- Enable OpenMP support by default in sandybridge's bli_config.h.
- Reorganized sandybridge's bli_kernel.h.
- Updated 3m4m Makefile, runme.sh to also test MKL implementation.
2015-04-02 18:12:28 -05:00
Tyler Michael Smith
36a9b7b743 reduced the default number of MC by KC blocks for bgq 2014-12-17 21:55:50 +00:00
Francisco D. Igual
483e4d6a3f Adding armv8a configuration and micro-kernels.
Only sgemm micro-kernel is fully functional at this point.
2014-12-07 20:27:49 +01:00
Field G. Van Zee
e56e61438f Minor cleanups to bli_threading.h and friends.
Details:
- No longer need to define BLIS_ENABLE_MULTITHREADING manually in
  bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or
  BLIS_ENABLE_PTHREADS is defined.
- Added sanity check to prevent both BLIS__ENABLE_OPENMP and
  BLIS_ENABLE_PTHREADS from being enabled simultaneously.
- Reorganization of bli_threading*.h header files, which led to
  simplification of threading-related part of blis.h.
- added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk
  file.
2014-11-26 17:20:35 -06:00
Field G. Van Zee
3be2744cbe Update to template gemm ukernel comments.
Details:
- Updated comments on alignment of a1 and b1 to match wiki.
2014-11-21 12:28:08 -06:00
Timmy
694029d9d7 #define PASTEF773 required by cblas compatiility layer 2014-11-19 15:25:14 -06:00
Field G. Van Zee
58796abda6 Removed KC constraint comments from _kernel.h files.
Details:
- Since 4674ca8c, the constraint that KC be a multiple of both MR and
  NR have been relaxed, and thus it was time to remove the comments
  from the top of the bli_kernel.h files of all configurations.
2014-11-06 14:31:52 -06:00
Field G. Van Zee
7bbc95a54f Added new piledriver micro-kernels.
Details:
- Added new micro-kernels for the AMD piledriver architecture (one
  for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
  acknowledging the influence of the corresponding kernel code in
  OpenBLAS.
2014-10-29 10:52:23 -05:00
Field G. Van Zee
7b6fe4cae5 Minor tweaks to sandybridge/avx micro-kernels.
Details:
- Changed the MC blocksize for zgemm micro-kernel from 128 to 64.
- Removed usage of b_next in all x86_64/avx gemm micro-kernels.
2014-10-12 12:01:51 -05:00
Field G. Van Zee
a6a156e9fe Added cgemm ukernel for avx/sandybridge.
Details:
- Implemented AVX-based cgemm micro-kernel (via GNU extended inline
  assembly syntax).
- Updated sandybridge configuration accordingly.
2014-10-10 14:26:41 -05:00
Field G. Van Zee
6f8575ab25 Added zgemm ukernel for avx/sandybridge.
Details:
- Implemented AVX-based zgemm micro-kernel (via GNU extended inline
  assembly syntax).
- Updated sandybridge configuration accordingly.
2014-10-10 10:01:45 -05:00
Tyler Smith
7a8ad47fb2 Minor changes to knc configuration, including preference row major storage
Also fixed a bug in the knc micro-kernel where it would fail if k == 0
2014-10-08 15:52:13 -05:00
Tyler Smith
a2b59a37f1 Fixed make defs so that they actually compile for bulldozer 2014-09-15 10:44:44 -05:00