Commit Graph

2182 Commits

Author SHA1 Message Date
RuQing Xu
08174a2f6e Evict <arm_sve.h> Requirement for SVE GEMM
For 8<= GCC < 10 compatibility.
2022-01-01 09:29:11 -06:00
Devin Matthews
54fa28bd84 Move edge cases to gemm ukr; more user-custom mods. (#583)
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels' 
  directory to take m and n dimensions, and implemented edge-case 
  handling within those microkernels via a collection of new C 
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride 
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of 
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params 
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
2021-12-24 08:00:33 -06:00
Kiran
961d9d509d Re-add BLIS_ENABLE_ZEN_BLOCK_SIZES macro for 'zen'.
Details:
- Added previously-deleted cpp macro block to bli_cntx_init_zen.c 
  targeting the Naples microarchitecture that enabled different cache 
  blocksizes when the number of threads exceeds 16. This commit 
  represents PR #573.
2021-12-07 15:30:38 -06:00
Devin Matthews
cf7d616a2f Enable user-customized packm ukernel/variant. (#549)
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while  
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to 
  replace the diagonal offset with panel offsets, and changed strides 
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the 
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a 
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to 
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the 
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from 
  its original pointer. This function also sets the object's root field 
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static 
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
2021-12-02 17:10:03 -06:00
Field G. Van Zee
e229e049ca Added recu-sed.sh script to 'build' directory.
Details:
- Added a recursive sed script to the 'build' directory.
2021-12-01 17:36:22 -06:00
Field G. Van Zee
12c66a4acc Minor updates to README.md, docs/Addons.md.
Details:
- Add additional mentions of addons to README.md, including in the
  "What's New" section.
- Removed mention of sandboxes from the long list of advantages
  provided by BLIS.
- Very minor description update to opening line of Addons.md.
2021-11-19 14:43:53 -06:00
Field G. Van Zee
a4bc03b990 Brief mention/link to Addons.md in README.md.
Details:
- Add a blurb about the new addons feature to the "Documentation for
  BLIS developers" section of the README.md, which also links to the
  Addons.md document.
2021-11-19 13:29:00 -06:00
Field G. Van Zee
b727645eb7 Merge branch 'dev' 2021-11-19 13:22:09 -06:00
Madan mohan Manokar
9be97c150e Support all four dts in test/test_her[2][k].c (#578)
Details:
- Replaced the hard-coded calls to double-precision real syr, syr2, 
  syrk, and syrk in the corresponding standalone test drivers in the 
  'test' directory with conditional branches that will call the 
  appropriate BLAS interface depending on which datatype is enabled. 
  Thanks to Madan mohan Manokar for this improvement.
- CREDITS file update.
2021-11-17 13:16:46 -06:00
Dipal M Zambare
26e4b6b293 Added support for AMD's Zen3 microarchitecture.
Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
  microarchitecture (#561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
  make_defs.mk files. The clang and AOCC version detection now happens
  in configure, not in the subconfigurations' makefile fragments. That
  is, we've added logic to configure that detects the version of
  clang/AOCC, outputs an appropriate variable to config.mk
  (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
  makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
  substitution anchor) to communicate whether the gcc version is older
  than 10.1.0, and use this variable to check for recent enough versions
  of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
  make_defs.mk so that the files are self-contained, harmonizing the
  format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
  reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
  previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
  completely disjoint from the models checked by bli_cpuid_is_zen2()
  (0x30 ~ 0xff). This is normally necessary because Zen and Zen2
  microarchitectures share the same family (23, or 0x17), and so the
  model code is the only way to differentiate the two. But in our case,
  fixing the model range for zen *wasn't* actually necessary since we
  checked for zen2 first, and therefore the wide zen range acted like
  the 'else' of an 'if-else' statement. That said, the change helps
  improve clarity for the reader by encoding useful knowledge, which
  was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
  Note that support for zen, zen2, and zen3 is now present, and while
  all the three microarchitectures have identical instruction sets from
  the perspective of BLIS microkernels, they each correspond to
  different subconfigurations and therefore merit separate testing.
  Thanks to Devin Matthews for his guidance in hacking these files as
  slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
  Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
  builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
  repository on GitHub rather than on Intel's website. This change was
  made in an attempt to circumvent recent troubles with Travis CI not
  being able to download the SDE directly from Intel's website via curl.
  Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
  Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
  which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
  older (bulldozer, piledriver, steamroller, and excavator)
  microarchitectures and moved those same subconfigs out of the amd64
  umbrella family. However, x86_64 retains amd64_legacy as a constituent
  member.
- Fixed a bug in configure related to the building of the so-called
  config list. When processing the contents of config_registry,
  configure creates a series of structures and lists that allow for
  various mappings related to configuration families, subconfigs, and
  kernel sets. Two of those lists are built via substitution of
  umbrella families with their subconfig members, and one of those
  lists was improperly performing the substitution in a way that would
  erroneously match on partial umbrella family names. That code was
  changed to match the code that was already doing the substitution
  properly, via substitute_words(). Also added comments noting the
  importance of using substitute_words() in both instances.
- Comment updates.
2021-11-17 13:02:00 -06:00
Field G. Van Zee
74c0c62221 Reverted cbc88fe.
Details:
- Reverted the annotation of some markdown code blocks with 'bash'
  after realizing that the in-browser syntax highlighting was not
  worthwhile.
2021-11-16 16:06:33 -06:00
Field G. Van Zee
cbc88feb51 Marked some markdown shell code blocks as 'bash'.
Details:
- Annotated the code blocks that represent shell commands and output as
  'bash' in README.md and BuildSystem.md.
2021-11-16 16:02:39 -06:00
Field G. Van Zee
78cd1b0451 Added 'Example Code' section to README.md.
Details:
- Inserted a new 'Example Code' section into the README.md immediately
  after the 'Getting Started' section. Thanks to Devin Matthews for
  recommending this addition.
- Moved the 'Performance' section of the README down slightly so that it
  appears after the 'Documentation' section.
2021-11-16 15:53:40 -06:00
Field G. Van Zee
7bde468c6f Added support for addons.
Details:
- Implemented a new feature called addons, which are similar to
  sandboxes except that there is no requirement to define gemm or any
  other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
  for requesting an addon be included within a BLIS build. configure now
  outputs the list of enabled addons into config.mk. It also outputs the
  corresponding #include directives for the addons' headers to a new
  companion to the bli_config.h header file named bli_addon.h. Because
  addons may wish to make use of existing BLIS types within their own
  definitions, the addons' headers must be included sometime after that
  of bli_config.h (which currently is #included before bli_type_defs.h).
  This is why the #include directives needed to go into a new top-level
  header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
  build with them, and what assumptions their authors should keep in
  mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
  as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
  functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.
2021-11-13 16:39:37 -06:00
Meghana-vankadari
7bc8ab485e Added BLAS/CBLAS APIs for axpby, gemm_batch. (#566)
Details:
- Expanded the BLAS compatibility layer to include support for 
  ?axpby_() and ?gemm_batch_(). The former is a straightforward
  BLAS-like interface into the axpbyv operation while the latter
  implements a batched gemm via loops over bli_?gemm(). Also
  expanded the CBLAS compatibility layer to include support for
  cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to 
  the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
  for submitting these new APIs via #566.
- Fixed a long-standing bug in common.mk that for some reason never
  manifested until now. Previously, CBLAS source files were compiled
  *without* the location of cblas.h being specified via a -I flag.
  I'm not sure why this worked, but it may be due to the fact that
  the cblas.h file resided in the same directory as all of the CBLAS
  source, and perhaps compilers implicitly add a -I flag for the
  directory that corresponds to the location of the source file being
  compiled. This bug only showed up because some CBLAS-like source code
  was moved into an 'extra' subdirectory of that frame/compat/cblas/src
  directory. After moving the code, compilation for those files failed
  (because the cblas.h header file, presumably, could not be found in
  the same location). This bug was fixed within common.mk by explicitly
  adding the cblas.h directory to the list of -I flags passed to the
  compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
  and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.
2021-11-11 16:46:14 -06:00
Devin Matthews
28b0982ea7 Refactored her[2]k/syr[2]k in terms of gemmt. (#531)
Details:
- Renamed herk macrokernels and supporting files and functions to gemmt, 
  which is possible since at the macrokernel level they are identical. 
  Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
  level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
  functions rather than cpp macros that instantiate multiple functions.
  Thanks to Devin Matthews for his efforts on this issue (#531).
- Check that the maximum stack buffer size is sufficiently large
  relative to the register blocksizes for each datatype, and do so when
  the context is initialized rather than when an operation is called.
  Note that with this change, users who pass in their own contexts into
  the expert interfaces currently will *not* have any checks performed.
  Thanks to Devin Matthews for suggesting this change.
2021-11-10 12:34:50 -06:00
Field G. Van Zee
cfa3db3f34 Fixed bug in mixed-dt gemm introduced in e9da642.
Details:
- Fixed a bug that broke certain mixed-datatype gemm behavior. This
  bug was introduced recently in e9da642 when the code that performs
  the operation transposition (for microkernel IO preference purposes)
  was moved up so that it occurred sooner. However, when I moved that
  code, I failed to notice that there was a cpp-protected "if"
  conditional that applied to the entire code block that was moved. Once
  the code block was relocated, the orphaned if-statement was now
  (erroneously) glomming on to the next thing that happened to be in the
  function, which happened to be the call to bli_rntm_set_ways_for_op(),
  causing a rather odd memory exhaustion error in the sba due to the
  num_threads field of the rntm_t still being -1 (because the rntm_t
  field were never processed as they should have been). Thanks to
  @ArcadioN09 (Snehith) for reporting this error and helpfully including
  relevant memory trace output.
2021-11-03 18:13:56 -05:00
Field G. Van Zee
f065a8070f Removed support for 3m, 4m induced methods.
Details:
- Removed support for all induced methods except for 1m. This included
  removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any
  code that existed only to support those implementations. These
  implementations were rarely used and posed code maintenance challenges
  for BLIS's maintainers going forward.
- Removed reference kernels for packm that pack 3m and 4m micropanels,
  and removed 3m/4m-related code from bli_cntx_ref.c.
- Removed support for 3m/4m from the code in frame/ind, then reorganized
  and streamlined the remaining code in that directory. The *ind(),
  *nat(), and *1m() APIs were all removed. (These additional API layers
  no longer made as much sense with only one induced method (1m) being
  supported.) The bli_ind.c file (and header) were moved to frame/base
  and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to
  frame/3.
- Removed 3m/4m support from the code in frame/1m/packm.
- Removed 3m/4m support from trmm/trsm macrokernels and simplified some
  pointer arithmetic that was previously expressed in terms of the
  bli_ptr_inc_by_frac() static inline function (whose definition was
  also removed).
- Removed the following subdirectories of level-0 macro headers from
  frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros
  defined in these directories were used exclusively for 3m and 4m
  method codes.
- Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in
  light of 1m being the only induced method left within BLIS.
- Removed dt_on_output field within auxinfo_t and its associated
  accessor functions.
- Re-indexed the 1e/1r pack schemas after removing those associated with
  variants of the 3m and 4m methods. This leaves two bits unused within
  the pack format portion of the schema bitfield. (See bli_type_defs.h
  for more info.)
- Spun off the basic and expert interfaces to the object and typed APIs
  into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c
  and bli_l3_tapi_ex.c.
- Moved the level-3 operation-specific _check function calls from the
  operations' _front() functions to the corresponding _ex() function of
  the object API. (This change roughly maintains where the _check()
  functions are called in the call stack but lays the groundwork for
  future changes that may come to the level-3 object APIs.) Minor
  modifications to bli_l3_check.c to allow the check() functions to be
  called from the expert interface APIs.
- Removed support within the testsuite for testing the aforementioned
  induced methods, and updated the standalone test drivers in the 'test'
  directory so reflect the retirement of those induced methods.
- Modified the sandbox contract so that the user is obliged to define
  bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light
  of the *nat() functions no longer existing.) Also updated the existing
  'power10' and 'gemmlike' sandboxes to come into compliance with the
  new sandbox rules.
- Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation
  to reflect the retirement of 3m/4m, and also modified Sandboxes.md to
  bring the document into alignment with new conventions.
- Updated various comments; removed segments of commented-out code.
2021-10-28 16:05:43 -05:00
Field G. Van Zee
e8caf200a9 Updated do_sde.sh to get SDE from GitHub.
Details:
- Updated travis/do_sde.sh so that the script downloads the SDE tarball
  from a new ci-utils repository on GitHub rather than from Intel's
  website. This change is being made in an attempt to circumvent Travis
  CI's recent troubles with downloading the SDE from Intel's website via
  curl. Thanks to Devin Matthews for suggesting the idea.
2021-10-18 13:04:15 -05:00
Field G. Van Zee
290ff4b1c2 Disable SDE testing of old AMD microarchitectures.
Details:
- Skip testing on piledriver, steamroller, and excavator platforms
  in travis/do_sde.sh.
2021-10-14 16:09:43 -05:00
Field G. Van Zee
514fd10174 Fixed substitution bug in configure.
Details:
- Fixed a bug in configure related to the building of the so-called
  config list. When processing the contents of config_registry,
  configure creates a series of structures and list that allow for
  various mappings related to configuration families, subconfigs,
  and kernel sets. Two of those lists are built via subsitituion
  of umbrella families with their subconfig members, and one of
  those lists was improperly performing the subtitution in a way
  that would erroneously match on partial umbrella family names.
  That code was changed to match the code that was already doing
  the subtitution properly, via substitute_words().
- Added comments noting the importance of using substitute_words()
  in both instances.
2021-10-14 13:50:28 -05:00
Field G. Van Zee
e9da6425e2 Allow use of 1m with mixing of row/col-pref ukrs.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
  precision real and double-precision real ukernels had opposing I/O
  preferences (row-preferential sgemm ukernel + column-preferential
  dgemm ukernel, or vice versa). The fix involved adjusting the API
  to bli_cntx_set_ind_blkszs() so that the induced method context init
  function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
  function for only one datatype at a time. This allowed the blocksize
  scaling (which varies depending on whether we're doing 1m_r or 1m_c)
  to happen on a per-datatype basis. This fixes issue #557. Thanks to
  Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
  bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
  called from each level-3 _front() function. The pack_t schemas in the
  cntx_t were also removed entirely, along with the associated accessor
  functions. This in turn required updating the trsm1m-related virtual
  ukernels to read the pack schema for B from the auxinfo_t struct
  rather than the context. This also required slight tweaks to
  bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
  the microkernel IO preference. This mostly only affects gemm. Thanks
  to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
  querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
  in bli_gks.c. The context initialization functions for induced methods
  were previously passed a dt argument, but I can no longer figure out
  *why* they were passed this value. To reduce confusion, I've removed
  the dt argument (including also from the function defintion +
  prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
  breaks high-leve implementations of 3m and 4m, but this is okay since
  those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
2021-10-13 14:15:38 -05:00
Minh Quan Ho
81e1034632 Alloc at least 1 elem in pool_t block_ptrs. (#560)
Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
  initialized as any unsigned integer, including 0. However, a length of
  0 could be problematic given that malloc(0) is undefined and therefore
  variable across implementations. As a safety measure, we check for
  block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
2021-10-13 13:28:02 -05:00
Minh Quan Ho
327481a4b0 Fix insufficient pool-growing logic in bli_pool.c. (#559)
Details:
- The current mechanism for growing a pool_t doubles the length of the
  block_ptrs array every time the array length needs to be increased
  due to new blocks being added. However, that logic did not take in
  account the new total number of blocks, and the fact that the caller
  may be requesting more blocks that would fit even after doubling the
  current length of block_ptrs. The code comments now contain two 
  illustrating examples that show why, even after doubling, we must 
  always have at least enough room to fit all of the old blocks plus
  the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
  from growing any pool_t that is initialized with a block_ptrs length
  of 0. (Previously, the memory pool for packed buffers of C was 
  initialized with a block_ptrs length of 0, but because it is unused 
  this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>
2021-10-12 12:53:04 -05:00
Devin Matthews
32a6d93ef6 Merge pull request #543 from xrq-phys/armsve-packm-fix
ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9
2021-10-09 15:53:54 -05:00
Devin Matthews
408906fdd8 Merge pull request #542 from xrq-phys/armsve-zgemm
Arm SVE CGEMM / ZGEMM Natural Kernels
2021-10-09 15:50:25 -05:00
RuQing Xu
ccf16289d2 Arm SVE C/ZGEMM Fix FMOV 0 Mistake
FMOV [hsd]M, #imm does not allow zero immediate.
Use wzr, xzr instead.
2021-10-08 12:34:14 +09:00
RuQing Xu
82b61283b2 SH Kernel Unused Eigher 2021-10-08 12:17:29 +09:00
RuQing Xu
1749dfa493 Arm SVE C/ZGEMM Support *beta==0 2021-10-08 12:13:08 +09:00
RuQing Xu
4b648e47da Arm SVE Config armsve Use ZGEMM/CGEMM 2021-10-08 12:13:08 +09:00
RuQing Xu
f76ea905e2 Arm SVE: Update Perf. Graph
Pic. size seems a bit different from upstream.
Generaged w/ MATLAB. Open to any change.
2021-10-08 12:13:08 +09:00
RuQing Xu
66a018e6ad Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
9e1e781cb5 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 2021-10-08 12:13:08 +09:00
RuQing Xu
f7c6c2b119 A64FX Config Use ZGEMM/CGEMM 2021-10-08 12:13:08 +09:00
RuQing Xu
e4cabb977d Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg 2021-10-08 12:13:08 +09:00
RuQing Xu
b677e0d61b Arm SVE Add SGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
3f68e8309f Arm SVE ZGEMM Support Gather Load / Scatt. St. 2021-10-08 12:13:07 +09:00
RuQing Xu
c19db2ff82 Arm SVE Add ZGEMM 2Vx10 Unindexed 2021-10-08 12:13:07 +09:00
RuQing Xu
e13abde30b Arm SVE Add ZGEMM 2Vx7 Unindexed 2021-10-08 12:13:06 +09:00
RuQing Xu
49b9d7998e Arm SVE Add ZGEMM 2Vx8 Unindexed 2021-10-08 12:12:48 +09:00
Devin Matthews
4277fec0d0 Merge pull request #533 from xrq-phys/arm64-hi-bw
ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig
2021-10-07 13:47:22 -05:00
Devin Matthews
2329d99016 Update Travis CI badge
[ci skip]
2021-10-07 12:37:58 -05:00
RuQing Xu
f44149f787 Armv8 Trash New Bulk Kernels
- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
  Will break 1m.
2021-10-08 02:35:58 +09:00
Devin Matthews
70b52cadc5 Enable testing 1m in make check. 2021-10-07 12:34:35 -05:00
RuQing Xu
2604f40713 Config ArmSVE Unregister 12xk. Move 12xk to Old 2021-10-07 02:39:00 +09:00
RuQing Xu
1e3200326b Revert __has_include(). Distinguish w/ BLIS_FAMILY_** 2021-10-07 02:37:14 +09:00
RuQing Xu
a4066f278a Register firestorm into arm64 Metaconfig 2021-10-07 02:26:05 +09:00
RuQing Xu
d7a3372247 Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo 2021-10-07 02:25:14 +09:00
RuQing Xu
2920dde5ac Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo 2021-10-07 02:01:45 +09:00
Devin Matthews
14b13583f1 Add test for Apple M1 (firestorm)
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
2021-10-06 10:22:34 -05:00