Commit Graph

48 Commits

Author SHA1 Message Date
RuQing Xu
f44149f787 Armv8 Trash New Bulk Kernels
- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
  Will break 1m.
2021-10-08 02:35:58 +09:00
RuQing Xu
d7a3372247 Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo 2021-10-07 02:25:14 +09:00
RuQing Xu
2920dde5ac Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo 2021-10-07 02:01:45 +09:00
RuQing Xu
b9da6d55fe Armv8 GEMMSUP Edge Cases Require Signed Ints
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
 change all [mn]_[iter/left] into signed ints.
2021-10-06 12:25:54 +09:00
RuQing Xu
40baf83f0e Armv8 Handle *beta == 0 for GEMMSUP ??r Case. 2021-10-06 01:00:52 +09:00
RuQing Xu
f5c03e9fe8 Armv8 Handle *beta == 0 for GEMMSUP ?rc Case. 2021-10-03 16:51:51 +09:00
RuQing Xu
abc648352c Armv8 Fix 6x8 Row-Maj Ukr
- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
   Old:
blis_dgemm_ukr_c                     6     8   320    36.87   2.43e-17   PASS
blis_dgemm_ukr_c                     6     8   352    40.55   1.04e-17   PASS
blis_dgemm_ukr_c                     6     8   384    44.24   5.68e-17   PASS
blis_dgemm_ukr_c                     6     8   416    41.67   3.51e-17   PASS
blis_dgemm_ukr_c                     6     8   448    34.41   2.94e-17   PASS
blis_dgemm_ukr_c                     6     8   480    42.53   2.35e-17   PASS

   New:
blis_dgemm_ukr_r                     6     8   352    50.69   1.59e-17   PASS
blis_dgemm_ukr_r                     6     8   384    49.15   5.55e-17   PASS
blis_dgemm_ukr_r                     6     8   416    50.44   2.86e-17   PASS
blis_dgemm_ukr_r                     6     8   448    46.92   3.12e-17   PASS
blis_dgemm_ukr_r                     6     8   480    48.08   4.08e-17   PASS
2021-10-03 13:14:19 +09:00
RuQing Xu
820f11a469 Arm Whole GEMMSUP Call Route is Asm/Int Optimized
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
  it's not called by any upper routine.
2021-08-27 13:40:26 +09:00
RuQing Xu
7e2951e61f Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
2021-08-23 17:06:44 +09:00
RuQing Xu
4fd82b0e93 Header Typo 2021-08-23 05:18:32 +09:00
RuQing Xu
35409ebe67 Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
2021-08-23 04:51:47 +09:00
RuQing Xu
a361492c24 Arm: DGEMMSUP ?rc(rd) Invoke Edge Size 2021-08-23 01:13:39 +09:00
RuQing Xu
e6799b26a6 Arm: Implement GEMMSUP Fallback Method
bli_dgemmsup_rv_armv8a_int_6x4mn
2021-08-21 02:39:38 +09:00
RuQing Xu
7d5903d8d7 Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
2021-08-21 01:55:50 +09:00
RuQing Xu
3df0e9b653 Arm64 8x4 Kernel Use Less Regs 2021-08-13 02:40:06 +09:00
RuQing Xu
4e7e225057 Armv8-A Supplimentary GEMMSUP Sizes for RD 2021-08-13 02:40:06 +09:00
RuQing Xu
c792d506ba Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Suffixed NEON opcode is not supported by GNU assembler
2021-08-13 02:40:06 +09:00
RuQing Xu
ce44735209 Armv8-A Adjust Types for PACKM Kernels
GCC does not have full NEON intrinsics support.
2021-08-13 02:40:06 +09:00
RuQing Xu
8a32d19af8 Armv8-A GEMMSUP-RD 6x8m
Armv8-A now has a complete set of GEMMSUP kernels..
2021-08-13 02:40:06 +09:00
RuQing Xu
afd0fa6ad1 Armv8-A GEMMSUP-RD 6x8n 2021-08-13 02:40:06 +09:00
RuQing Xu
3c5f740514 Armv8-A s/d Packing Kernels Fix Typo
For GCC.
2021-08-13 02:40:06 +09:00
RuQing Xu
49b05df792 Armv8-A Introduced s/d Packing Kernels
Sizes according to the 2014 kernels.
2021-08-13 02:40:06 +09:00
RuQing Xu
c3faf93168 Armv8-A DGEMMSUP 6x8m Kernel
Recommended kernels set:
  ...
  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  ...
  bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
                                              -1,     8,    -1,    -1 );
  bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
  ...
2021-08-13 02:40:06 +09:00
RuQing Xu
3efe707b55 Armv8-A DGEMMSUP Adjustments 2021-08-13 02:40:06 +09:00
RuQing Xu
8ed8f5e625 Armv8-A Add More DGEMMSUP
- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.
2021-08-13 02:40:06 +09:00
RuQing Xu
a9ba79ea14 Armv8-A Add GEMMSUP 4x8n Kernel
- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.
2021-08-13 02:40:06 +09:00
RuQing Xu
df40efe8fb Armv8-A Add Part of GEMMSUP 8x4m Kernel
- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead
2021-08-13 02:40:06 +09:00
RuQing Xu
6639999288 Armv8A DGEMM 4x4 Kernel WIP. Slow
Quite slow.
2021-08-13 02:40:06 +09:00
RuQing Xu
a29c16394c Armv8-A Add 8x4 Kernel WIP
Test result: a bit lower GFlOps than 6x8.
2021-08-13 02:40:04 +09:00
RuQing Xu
5fc93e2806 Armv8A Rename Regs for Safe Darwin Compile
Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.

FP64 does not require changing since x18 is not used there.
2021-05-29 18:44:47 +09:00
RuQing Xu
9f4a4a3cfb Armv8A Rename Regs for Clang Compile: FP32 Part
Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.

Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.
2021-05-29 17:21:28 +09:00
RuQing Xu
916e1fa8be Armv8A Rename Regs for Clang Compile: FP64 Part
- x7, x8: Used to store address for Alpha and Beta.
  As Alpha & Beta was not used in k-loops, use x0, x1 to load
  Alpha & Beta's addresses after k-loops are completed, since A & B's
  addresses are no longer needed there.
  This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
  drawback since it is done outside k-loops and there are plenty of
  instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
  any longer. Directly loading cs_c and into x10 and scale by 8 spares
  x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
  also used in a conditional branch so that "cmp x13, #1" needs to be
  modified into "cmp x14, #8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
  these addresses into x0 and x1 after Alpha & Beta are both loaded,
  since then neigher address of A/B nor address of Alpha/Beta is needed.
2021-05-29 16:46:52 +09:00
RuQing Xu
7fabd896af Asm Flag Mingling for Darwin_Aarch64
Apple+Arm64 requires additional "tagging" of local symbols.
2021-05-29 16:28:03 +09:00
Guodong Xu
28be1a4265 avoid loading twice in armv8a gemm kernel (#403)
This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.

In current design, first row/col. of a and b are loaded twice.

The fix is to rearrange a and b (first row/col.) loading instructions.

Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
2020-05-20 13:22:22 -05:00
Field G. Van Zee
ffce3d632b Renamed armv8a gemm kernel filename.
Details:
- Renamed
    kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c
  to
    kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c.
  This follows the naming convention used by other kernel sets, most
  notably haswell.
2019-04-02 14:40:50 -05:00
Field G. Van Zee
0645f239fb Remove UT-Austin from copyright headers' clause 3.
Details:
- Removed explicit reference to The University of Texas at Austin in the
  third clause of the license comment blocks of all relevant files and
  replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
  comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
  with format of all other comment blocks.
2018-12-04 14:31:06 -06:00
Field G. Van Zee
4fa4cb0734 Trivial comment header updates.
Details:
- Removed four trailing spaces after "BLIS" that occurs in most files'
  commented-out license headers.
- Added UT copyright lines to some files. (These files previously had
  only AMD copyright lines but were contributed to by both UT and AMD.)
- In some files' copyright lines, expanded 'The University of Texas' to
  'The University of Texas at Austin'.
- Fixed various typos/misspellings in some license headers.
2018-08-29 18:06:41 -05:00
Field G. Van Zee
5140ee3424 Updated types of bli_is_[un]aligned_to() functions.
Details:
- Changed the void* arguments of the following static functions:
    bli_is_aligned_to()
    bli_is_unaligned_to()
    bli_offset_past_alignment()
  to siz_t, and the return type of bli_offset_past_alignment() from
  guint_t to siz_t. This allows for more versatile usage of these
  functions (e.g. when aligning both pointers and leading dimension).
- Updated all invocations of these functions, mostly in kernels/penryn
  but also in kernels/bgq, to include explicit typecasts to siz_t when
  pointer arguments are passed in.
- Thanks to Devin Matthews for pointing out this potential bug (via issue
  #211).
- Deleted a few trailing spaces in various penryn kernels.
- Removed duplicate instances of the words "derived" and "THEORY" from
  various kernel license headers, likely from a malformed recursive sed
  performed long ago.
2018-05-23 16:56:14 -05:00
Francisco Igual
f28a152938 Fixed clobber list bug in ARMv8 ukernel 2018-05-17 09:26:14 +00:00
Field G. Van Zee
2e31dd7852 Inserted missing integer typecasting into ukernels.
Details:
- Inserted missing safeguards into most microkernels to ensure that the
  integers read by the microkernel's assembly instructions are of the
  appropriate size. In many cases, this bug was going undetected likely
  because the compiler was inserting zero padding before the integers
  in the calling function, allowing the assembly code to read 64-bits
  in a way that did not corrupt the "lower" 32 integer bits with garbage
  in the higher bits. Thanks to Francisco Igual and Devangi Parikh for
  finding this issue.
2018-05-16 17:28:33 -05:00
Field G. Van Zee
453deb2906 Implemented runtime kernel management.
Details:
- Reworked the build system around a configuration registry file, named
  config_registry', that identifies valid configuration targets, their
  constituent sub-configurations, and the kernel sets that are needed by
  those sub-configurations. The build system now facilitates the building
  of a single library that can contains kernels and cache/register
  blocksizes for multiple configurations (microarchitectures). Reference
  kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
  config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
  in determining which sub-configurations (CONFIG_LIST) and kernel sets
  (KERNEL_LIST) are included in the library, and which make_defs.mk files'
  CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
  functions into a standard format that includes the kernel set name
  (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
  kernels sub-directory. These files exist to provide prototypes for the
  kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
  This directory includes a new source file, bli_cntx_ref.c (compiled on
  a per-configuration basis), that defines the code needed to initialize
  a reference context and a context for induced methods for the
  microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
  variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
  basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
  #defines for the config (family) name, the sub-configurations that are
  associated with the family, and the kernel sets needed by those
  sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
  what remains to new header files named "bli_arch_<configname>.h", which
  are conditionally #included from a new header bli_arch.h. These files
  are still needed to set library-wide parameters such as custom
  malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
  The files contain a function, named the same as the file, that initializes
  a "native" context for a particular configuration (microarchitecture). The
  idea is that optimized kernels, if available, will be initialized into
  these contexts. Other fields will retain pointers to reference functions,
  which will be compiled on a per-configuration basis. These bli_cntx_init_*()
  functions will be called during the initialization of the global kernel
  structure. They are thought of as initializing for "native" execution, but
  they also form the basis for contexts that use induced methods. These
  functions are prototyped, along with their _ref() and _ind() brethren, by
  prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
  identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
  structures (pointers to cntx_t, actually). The first dimension is indexed
  over arch_t and the inner dimension is the ind_t (induced method) for
  each microarchitecture. When a microarchitecture (configuration) is
  "registered" at init-time, the inner array for that configuration in the
  2D array is initialized (and allocated, if it hasn't been already). The
  cntx_t slot for BLIS_NAT is initialized immediately and those for other
  induced method types are initialized and cached on-demand, as needed. At
  cntx_t registration, we also store function pointers to cntx_init functions
  that will initialize (a) "reference" contexts and (b) contexts for use with
  induced methods. We don't cache the full contexts for reference contexts
  since they are rarely needed. The functions that initialize these two kinds
  of contexts are generated automatically for each targeted sub-configuration
  from cpp-templatized code at compile-time. Induced method contexts that
  need "stage" adjustments can still obtain them via functions in
  bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
  the level-1f, level-1v, and packm kernels, and for converting a native
  context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
  macros in bli_kernel_macro_defs.h to being runtime checks defined in
  bli_check.c and called from bli_gks_register_cntx() at the time that the
  global kernel structure's internal context is initialized for a given
  microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
  the previous operation-level cntx_t_init()/_finalize() invocations.
  Instead, we now query the gks for a suitable context, usually via
  bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
  hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
  into a single kernel that will branch on the schema and support packing
  to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
  and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
  bli_malloc_intl(). Useful when we wish to allocate and initialize to
  zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
  bli_cntx.h into static functions.
2017-10-18 13:29:32 -05:00
Field G. Van Zee
f484c6cd43 Whitespace reformatting to armv8a kernels file.
Details:
- Updated formatting of function signature/header in
  kernels/armv8a/3/bli_gemm_opt_4x4.c.
2017-03-17 12:07:27 -05:00
Francisco Igual
7f31a6307b Fixed missing cntx argument in ARMv8 microkernels. 2016-11-27 14:40:47 +01:00
Devin Matthews
08f1d6b6fa Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit. 2016-07-22 13:44:37 -05:00
Field G. Van Zee
537a1f4f85 Implemented runtime contexts and reorganized code.
Details:
- Retrofitted a new data structure, known as a context, into virtually
  all internal APIs for computational operations in BLIS. The structure
  is now present within the type-aware APIs, as well as many supporting
  utility functions that require information stored in the context. User-
  level object APIs were unaffected and continue to be "context-free,"
  however, these APIs were duplicated/mirrored so that "context-aware"
  APIs now also exist, differentiated with an "_ex" suffix (for "expert").
  These new context-aware object APIs (along with the lower-level, type-
  aware, BLAS-like APIs) contain the the address of a context as a last
  parameter, after all other operands. Contexts, or specifically, cntx_t
  object pointers, are passed all the way down the function stack into
  the kernels and allow the code at any level to query information about
  the runtime, such as kernel addresses and blocksizes, in a thread-
  friendly manner--that is, one that allows thread-safety, even if the
  original source of the information stored in the context changes at
  run-time; see next bullet for more on this "original source" of info).
  (Special thanks go to Lee Killough for suggesting the use of this kind
  of data structure in discussions that transpired during the early
  planning stages of BLIS, and also for suggesting such a perfectly
  appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
  structure" (gks). This data structure and API will allow the caller to
  initialize a context with the kernel addresses, blocksizes, and other
  information associated with the currently active kernel configuration.
  The currently active kernel configuration within the gks cannot be
  changed (for now), and is initialized with the traditional cpp macros
  that define kernel function names, blocksizes, and the like. However,
  in the future, the gks API will be expanded to allow runtime management
  of kernels and runtime parameters. The most obvious application of this
  new infrastructure is the runtime detection of hardware (and the
  implied selection of appropriate kernels). With contexts in place,
  kernels may even be "hot swapped" at runtime within the gks. Once
  execution enters a level-3 _front() function, the memory allocator will
  be reinitialized on-the-fly, if necessary, to accommodate the new
  kernels' blocksizes. If another application thread is executing with
  another (previously loaded) kernel, it will finish in a deterministic
  fashion because its kernel information was loaded into its context
  before computation began, and also because the blocks it checked out
  from the internal memory pools will be unaffected by the newer threads'
  reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
  the code enabling use of induced methods for complex domain matrix
  multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
  those APIs' functionality is now mostly subsumed within the global
  kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
  that will reinitialize a memory pool if the necessary pool block size
  has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
  bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
  usage of contexts where appropriate to communicate cache and register
  blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
  the context and/or the global kernel structure:
  - Removed blocksize object pointers (blksz_t*) fields from all control
    tree node definitions and replaced them with blocksize id (bszid_t)
    values instead, which may be passed into a context query routine in
    order to extract the corresponding blocksize from the given context.
  - Removed micro-kernel function pointers (func_t*) fields from all
    control tree node definitions. Now, any code that needs these function
    pointers can query them from the local context, as identified by a
    level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
    level-1v kernel id (l1vkr_t).
  - Removed blksz_t object creation and initialization, as well as kernel
    function object creation and initialization, from all operation-
    specific control tree initialization files (bli_*_cntl.c), since this
    information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
  blocksize multiples for each blocksize id (bszid_t) in the context
  object.
- Removed the bool_t's that were required when a func_t was initialized.
  These bools are meant to allow one to track the micro-kernel's storage
  preferences (by rows or columns). This preference is now tracked
  separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
  files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
  util directories, but has the most obvious effect of allowing BLIS
  to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
  in an attempt to reduce overhead for memory-bound operations. This
  includes removal of default use of object-based variants for level-2
  operations. Now, by default, level-2 operations will directly call a
  low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
  bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
  respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
  heterogeneous bool_t's (one for each floating-point datatype), in the
  same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
  and BLIS_SIMD_SIZE. These values are needed in order to compute a third
  new parameter, which may be set indirectly via the aforementioned
  macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
  statically allocate memory in macro-kernels and the induced methods'
  virtual kernels to be used as temporary space to hold a single
  micro-tile. These values are now output by the testsuite. The default
  value of BLIS_STACK_BUF_MAX_SIZE is computed as
  "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
  embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
  and "haswell," respectively, and gave more consistent and meaningful
  names to many kernel files (as well as updating their interfaces to
  conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
  context for test modules that need those values: axpyf, dotxf,
  dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
  more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
  for level-1m-like operations on small matrices) in frame/include/level0
  to use more obscure local variable names in an effort to avoid variable
  shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
  which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
  of scalm. The semantic meaning of the conj argument is to optionally
  allow implicit conjugation of the scalar prior to being populated into
  the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
  that this does not preclude supporting mixed types via the object APIs,
  where it produces absolutely zero API code bloat.
2016-04-11 17:21:28 -05:00
figual
af92773f4f Updated and improved ARMv8 micro-kernels. 2016-03-23 22:07:02 +01:00
Francisco Igual
a0a7b85ac3 Fixed incomplete code in the double precision ARMv8 microkernel. 2015-10-27 08:59:15 +00:00
Francisco D. Igual
483e4d6a3f Adding armv8a configuration and micro-kernels.
Only sgemm micro-kernel is fully functional at this point.
2014-12-07 20:27:49 +01:00