Commit Graph

263 Commits

Author SHA1 Message Date
Chandrashekara KR
d7377f967c Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1 2021-06-17 08:49:55 -04:00
Chandrashekara K R
f94e3ad237 AOCL-Windows: Update BLIS build system
1. Added support in cmake scripts for linking libomp for blis multithreading build.
 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.

AMD Internal : [CPUPL-1630]

Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
2021-06-15 16:49:08 +05:30
Nageshwar Singh
6ca50e1b72 Added bench utility for copyv API
AOCL-Internal: [CPUPL-1591]
Change-Id: I00ddad565cb87cd9371d7b1df2b57394fef437e0
2021-06-09 12:29:49 +05:30
Nageshwar Singh
6842c2a30e Bench trsv logging error
Details
  - Passing enum rather than char for uplo, transa, and diaga
  - Deleting log file, and other temp files, merged in the codebase from amax

AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
2021-06-08 11:54:55 +05:30
Nageshwar Singh
61b7584580 Bench addition for amaxv API
AOCL-Internal: [CPUPL-1591]
Change-Id: Ia9754dfed1a7302d5c267858f9005c8f64e28b46
2021-06-04 17:45:04 +05:30
Dipal Madhukar Zambare
821fa267c9 Merge "Updated makefiles to fix issues introduced in merge" into amd-staging-milan-3.1 2021-05-05 23:42:15 -04:00
Dipal M Zambare
7454cca9e7 Updated makefiles to fix issues introduced in merge
- Updated Makefile to include DTL files in library build
   - Updated Makefile to include cpp header file installation
   - Updated test/makefile to include extra API added by AMD team.

AMD-Internal: [CPUPL-1559]
Change-Id: I249c6935d5ff5fb645f9deec7e0218575484be13
2021-05-05 14:59:15 +05:30
Nallani Bhaskar
f917d826b5 Updated test application to work with row major cblas
Details:
1. Fixed reading leading dimenstions in test_gemm.c based on row/col major
2. Reduced redundent code and adjusted alignment

Change-Id: I8ca8c81223386fc21c6cc7c1d8f8a2109c9f5343
2021-05-02 23:09:13 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
satish kumar nuggu
c0d16f70e4 Added CBLAS API interface and memory alignment check
Details:
1. Added CBLAS API in test_gemm.c and test_trsm.c
2. Creating matrixes with memory aligned leading dimensions irrespective of dimension.
3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end.

   AMD-Internal: [CPUPL-1500] [CPUPL-1450]

Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a
2021-04-26 13:44:30 +05:30
Field G. Van Zee
ba3ba8da83 Minor updates and fixes to test/3/octave scripts.
Details:
- Fixed an issue where the wrong string was being passed in for the
  vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
2021-04-06 18:39:58 -05:00
Kiran Varaganti
657f58b82b Fix test_dotv.c when complex-return=intel
When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes.
Now "return parameter" will become the first argument of these functions and these functions return void.
This fix addresses the change in the function declarations when --complex-return=intel is enabled.
Missing Trace and Log statements for this configuration are now  added.
[CPUPL-1376]

Change-Id: Ib420989da71839211c16088bf431a2ad775a3978
2021-04-06 15:38:12 +05:30
Field G. Van Zee
159ca6f01a Made test/3/octave scripts robust to missing data.
Details:
- Modified the octave scripts in test/3 so that the script does not
  choke when one or more of the expected OpenBLAS, Eigen, or vendor data
  files is missing. (The BLIS data set, however, must be complete.) When
  a file is missing, that data series is simply not included on that
  particular graph. Also factored out a lot of the redundant logic from
  plot_panel_4x5.m into a separate function in read_data.m.
2021-03-24 15:57:32 -05:00
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Kumar, Phani
477fc41fff Cmake script changes and blis.h changes for amd-staging-milan-3.0
AMD Internal : [CPUPL-1083]

Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580
2020-11-24 06:12:25 -05:00
Dipal M Zambare
0a3d94c9a2 Updated test drivers for dotv, scalv and swapv.
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.

AMD-Internal : [CPUPL-1315]

Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
2020-11-24 10:26:32 +05:30
Madan mohan Manokar
1d8fab0996 Test driver fix for her and her2
Fixed test driver code for her, her2
Support added to handle complex and double complex data type in test driver.

Change-Id: If65939e99d8cf77e0fb70561166d84bf67d0321d
AMD-Internal: [CPUPL-1326]
2020-11-23 04:10:43 -05:00
Madan Mohan Manokar
35d33bab6a Merge "Test driver fix for her, her2, herk and her2k" into amd-staging-milan-3.0 2020-11-19 05:47:31 -05:00
Madan mohan Manokar
38698f0dfd Test driver fix for her, her2, herk and her2k
Fixed test driver code for her, her2, herk and her2k function.
Above functions supports only complex and double complex data type, test code is updated accordingly.

Change-Id: Iee7b79abda4a2959a265c420d23879bf47f2c38d
AMD-Internal: [CPUPL-1313]
2020-11-19 12:58:21 +05:30
satish kumar nuggu
17f994bd15 Added Blas interface for ?imatcopy, ?omatcopy, ?omatadd, ?omatcopy2
AMD-Internal: [CPUPL-1116]
Original review was in this commit http://gerrit-git.amd.com/c/cpulibraries/er/blis/+/428165.
Added new commit for transpose API's

Change-Id: I322389cc0be0aaccf82d1d0bb4476beea8694cd8
2020-11-18 12:55:36 +05:30
Field G. Van Zee
88ad841434 Squash-merge 'pr' into 'squash'. (#457)
Merged contributions from AMD's AOCL BLIS (#448).
  
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
  only the lower or upper triangle of a square matrix C. For now, only
  the conventional/large code path will be supported (in vanilla BLIS).
  This was accomplished by leveraging the existing variant logic for
  herk. However, some of the infrastructure to support a gemmtsup is
  included in this commit, including
  - A bli_gemmtsup() front-end, similar to bli_gemmsup().
  - A bli_gemmtsup_ref() reference handler function.
  - A bli_gemmtsup_int() variant chooser function (with variant calls
    commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
  wrapper to a set of polymorphic CBLAS-like function wrappers defined
  in another header (cblas.hh). These two headers are installed if
  running the 'install' target with INSTALL_HH is set to 'yes'. (Also
  added a set of unit tests that exercise blis.hh, although they are
  disabled for now because they aren't compatible with out-of-tree
  builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
  within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
  various minor updates to dotv and scalv kernels. Also added various
  sup kernels contributed by AMD to kernels/zen/3. However, these
  kernels are (for now) not yet used, in part because they caused
  AppVeyor clang failures, and also because I have not found time to
  review and vet them.
- Output the python found during configure into the definition of PYTHON
  in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
  to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
  bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
  bug surfaced because the gemmt module verifies its computation using
  gemm with its beta parameter set to zero, which, on a cortexa15 system
  caused the gemm kernel code to unconditionally multiply the
  uninitialized C data by beta. The C matrix likely contained
  non-numeric values such as NaN, which then would have resulted in a
  false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
  in bli_l3_blocksize.c, was inadvertantly being defined in terms of
  helper functions meant for trmm. This bug was probably harmless since
  the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
  kernels/zen/3/bli_gemm_small.c since those macros are not used in
  vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
  accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
  Windows systems.
- Various whitespace changes.
2020-11-14 09:39:48 -06:00
managalv
aae48c2221 Optimised AXPYF routine for complex float and complex double
Details:
    - Added SIMD code
    - Processing 5 rows at a time in SIMD loop to improve performance

AMD-Internal: [CPUPL-1054]

Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
2020-11-06 18:42:13 +05:30
managalv
68b4ff976b Added Function trace and Input logging for dotv and gemv
Change-Id: I992bd80b2322d6c387f609ecd70c1109c13f6254
AMD-Internal: [CPUPL-1274]
2020-11-06 03:53:43 +05:30
Meghana Vankadari
0775f09b41 Added debug trace and log support for copy and ger routines
Change-Id: Id7fb64c0a626b2f8f53e89ee7df4391693eb4f4c
2020-11-02 22:56:58 -05:00
Kiran Varaganti
65daaab6ac Fix Bug in DTL
When library is built as single thread and trace is enabled, the test
applications in test folder fail to compile. In the file aoclos.c the function
AOCL_gettid() uses "omp_get_thread_num() to get thread_id, which is only
enabled when OpenMP based parallel BLIS library is generated. To fix this in
single thread case we now return zero for thread id, openmp function is used
only when BLIS_ENABLE_OPENMP macro is defined. However this is not a complete
fix. If library is built with pthread, AOCL_gettid() always return 0, which is
not the intended behaviour.

Change-Id: I5b79ed57d27d0022d3dcab0e2a3a557c8e4ff8ee
2020-11-02 12:05:09 +05:30
bhaskarn
376ac5856b Added BLAS Extension API's: CBLAS_?GEMM3M
AMD-Internal: [CPUPL-1151]

Induced 3M1 method is enabled for CGEMM3M and ZGEMM3M

Change-Id: I8276c5018340d0a45694551f48aad5b735819eae
2020-10-29 17:06:30 +05:30
Nageshwar Singh
dd5b38d221 Added BLIS, BLAS, and CBLAS interface for cblas?amin
Details:
      - Amin api returns index of minimum absolute value in a vector.
      - Added amin reference blis kernel.
      - Added blas and cblas interface for amin.

AMD-Internal: [CPUPL-1155]

Change-Id: I89c1e37e86950a4582bba70a5d8fc70ac915bd3c
2020-10-28 17:50:27 +05:30
Nageshwar Singh
dbd7b28373 Development of AVX2 axpyv kernels for c and z datatypes.
Details
    - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
    - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.

AMD-Internal: [CPUPL-1231]

Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
2020-10-23 09:33:35 +05:30
Nageshwar Singh
b245ea9c65 cblas_?cabs1 test cblas header file bug
Details:
    - Added BLIS_ENABLE_CBLAS around cblas header file.

AMD-Internal: [CPUPL-1129]

Change-Id: I3baacd26aa96c8eeb753d95210817ffe9b2a3f85
2020-10-13 01:43:55 -04:00
Field G. Van Zee
7677e9ba60 Merge branch 'dev' of github.com:flame/blis into dev 2020-10-09 15:41:25 -05:00
Field G. Van Zee
addcd46b05 Added Epyc 7742 Zen2 ("Rome") sup perf results.
Details:
- Added single-threaded and multithreaded sup performance results to
  docs/PerformanceSmall.md for both sgemm and dgemm. These results were
  gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
  microarchitecture. Special thanks to Jeff Diamond for facilitating
  access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
  and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
  consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
  I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
  microarchitecture section.
2020-10-09 15:41:09 -05:00
Field G. Van Zee
a0849d390d Register l3 sup kernels in zen2 subconfig.
Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
  and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
  system.
2020-10-09 20:22:17 +00:00
Meghana Vankadari
016885348c Added CBLAS interface and test file for gemm_batch API
AMD-Internal: [CPUPL-1184]
Change-Id: Icc5c41429b0d92f1a66a955769cc0518ca4706ee
2020-10-07 06:55:08 -04:00
Kiran Varaganti
aa56c36b82 Fixed Logs & code cleanup
Fixed AOCL DTL logs printing incorrect alpha and beta values for single
precision. Added missing info like data-types, lower or upper traingular and
Side parameter in the case of TRSM.
Code cleanup and formatting the files test_cabs1.c, test_axpbyv.c and
test_gemm.c. In dumping trsm parameters replaced 'side' with bli_is_right(side).

Change-Id: Ic81503ae696956eb074ec208f7109d1a394183d7
2020-10-04 22:17:31 +05:30
Field G. Van Zee
bc4a213a2c Updated matlab (now octave) plot code in test/3.
Details:
- Renamed test/3/matlab to test/3/octave.
- Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
  files for use with octave (which is free and doesn't crash on me
  mid-way through my use of subplot).
- Updated runthese.m scratchpad for zen2 invocations.
- Added Nikolay S.'s subplot_tight() function, along with its license.
2020-09-30 15:28:20 -05:00
Field G. Van Zee
c77ddc4181 Added optional numactl usage to test/3/runme.sh. 2020-09-30 20:15:43 +00:00
Field G. Van Zee
4fd8d9fec2 Tweaked zen2 subconfig's MC cache blocksizes.
Details:
- Updated the MC cache blocksizes registered by the 'zen2' subconfig.
- Minor updates to test/3/Makefile and test/3/runme.sh.
2020-09-28 23:39:05 +00:00
Nageshwar Singh
5243da5cec BLIS: CBLAS Extensions. cblas_?cabs1 : Absolute value of a complex number cabs1
Details:
   - added cblas extension cblas_?cabs1.
   - Functionality : res=|Re(z)|+|Im(z)|, z is a complex number, and res is a      value containing the absolute value of a complex number z.

AMD-Internal: [CPUPL-1129]

Change-Id: I4a3c265c89527c8fd3060c5d2ed38b1953ce6343
2020-09-23 19:01:02 +05:30
Meghana Vankadari
80828f6fda Added few changes in test_axpbyv.c file
Details:
- Corrected "#if" directive in line 89
- Commented out "#define print" to disable printing the vectors

Change-Id: I9ec3cbfb716540dd3e2264f5c3925d9e0c0c294a
2020-09-16 23:48:01 -04:00
Kiran Varaganti
5a8bd9f41c Enable BLAS interface in BLIS
Modified Makefile in test folder to enable calling BLAS interfaces for BLIS as
well. This is possible by replacing -DBLIS with -DBLAS=\"aocl\" in the
makefile. Also added linking to multi-threaded MKL library.

Change-Id: Iccf2ec99b48bb35da985b69218bc680f678ff7c9
2020-09-16 23:26:44 +05:30
Meghana Vankadari
6c9bf36424 Added BLAS and CBLAS interfaces for axpby API
Details:
- The axpby routines perform a vector-vector operation defined as
  y = a*x + b*y where a, b are scalars and x, y are vectors.
- This API is part of BLAS-like extension APIs

Change-Id: I17a53b03bba97de7ae1995a9f086084bd241bcdc
AMD-Internal: [CPUPL-1118]
2020-09-16 09:49:31 +05:30
Field G. Van Zee
e293cae2d1 Implemented sgemmsup assembly kernels.
Details:
- Created a set of single-precision real millikernels and microkernels
  comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
  bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
  file in test/sup. This included edits that allow for separate "small"
  dimensions for single- and double-precision as well as for single-
  vs. multithreaded execution.
2020-09-15 16:09:11 -05:00
dzambare
267a959af1 Rebased amd-staging-milan-3.0 branch on master
-- Rebased on top of master commit # 6e522e5823
  -- Updated merged code to remove duplicated code added by auto-merging
  -- Updated merged code to rename bool_t type
  -- Updated merged code to rename bli_thread_obarrier
  -- Updated merged code to rename bli_thread_obroadcast

Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c
AMD-Internal: [CPUPL-1067]
2020-08-06 10:09:29 +05:30
phakumar
c7a914411f BLIS library porting on to Windows:
GEMMT changes porting on to Windows

AMD Internal : [CPUPL-1061]

Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a
2020-08-06 10:09:29 +05:30
Field G. Van Zee
26cd966af7 Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.

Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
2020-08-06 10:09:28 +05:30
Field G. Van Zee
2839b88f00 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-08-06 10:09:28 +05:30
Field G. Van Zee
068072b6b4 Updated BLASFEO results in PerformanceSmall.md.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
  using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
  accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
  containing the mpnpkp results do not need to be preprocessed in order
  to plot half the problem size range (ie: up to 400 instead of the
  800 range of the other shape cases).
- Trivial updates to runme.m.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
50c0ffe390 Added BLASFEO results to docs/PerformanceSmall.md.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
  and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
  docs/Performance.md.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
89105695cc Added sup performance graphs/document to 'docs'.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
  publishes new performance graphs for Kaby Lake and Epyc showcasing
  the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
  For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
  graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
  in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
  document.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
889b90888f Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2020-08-03 11:48:42 +05:30