Commit Graph

2437 Commits

Author SHA1 Message Date
Nallani Bhaskar
b239a5aee7 Bug fix in sgemmsup 1x16n kernel
Details:

Address increment was missing in bli_sgemmsup_rv_zen_asm_1x16 kernel
while storing output in column major order in beta zero case

JIRA: CPUPL-1548

Change-Id: I36269cd28de6fbef2256451e399f90f0437b0ce1
2021-04-28 21:33:30 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Satish Kumar Nuggu
743732c939 Merge "Added CBLAS API interface and memory alignment check" into amd-staging-milan-3.1 2021-04-26 04:27:21 -04:00
satish kumar nuggu
c0d16f70e4 Added CBLAS API interface and memory alignment check
Details:
1. Added CBLAS API in test_gemm.c and test_trsm.c
2. Creating matrixes with memory aligned leading dimensions irrespective of dimension.
3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end.

   AMD-Internal: [CPUPL-1500] [CPUPL-1450]

Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a
2021-04-26 13:44:30 +05:30
Field G. Van Zee
6a4aa986ff Fixed typo in Table of Contents. 2021-04-23 13:10:01 -05:00
Field G. Van Zee
f6424b5b82 Added dedicated Performance section to README.md.
Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
  Documentation section into a new Performance section dedicated to
  those two links. (The previous entries remain redundantly listed
  within Documentation section.) Thanks to Robert van de Geijn for
  suggesting this change.
2021-04-23 13:08:06 -05:00
Madan mohan Manokar
f6088ac1cf Enabling 3m_sqp and 3m1 methods
1. Re-enabling 3m methods for zgemm.
2. Vectorization of pack_sum routines re-enabled with bug fix.
3. 8mx6n kernel added.

AMD-Internal: [CPUPL-1352]
Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810
2021-04-22 02:47:31 -04:00
Meghana Vankadari
d68f427ced Merge "Added bench app for gemmt - input is a log file generated from AOCL DTL" into amd-staging-milan-3.1 2021-04-22 00:20:16 -04:00
Devin Matthews
40ce5fd241 Merge pull request #493 from cassiersg/patch-1
Fix typo in FAQ.md
2021-04-21 09:54:25 -05:00
Gaëtan Cassiers
1f3461a5a5 Fix typo in FAQ.md 2021-04-21 16:49:05 +02:00
Mangala V
9e147912ee Merge "ZGEMM SUP: Removed unused assembly intructions" into amd-staging-milan-3.1 2021-04-19 03:08:31 -04:00
managalv
6c3741cd3e ZGEMM SUP: Removed unused assembly intructions
Removed memory operations which were being unused
Modified labels to be unquie to a file
Rowstride update is done at once to avoid multiple mul instruction

AMD Internal  : [CPUPL-1419]

Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103
2021-04-19 20:31:03 +05:30
Meghana Vankadari
713ca659b5 Added bench app for gemmt - input is a log file generated from AOCL DTL
Change-Id: Ia3390b529244f529d9741c86a6f8dc35a589f714
2021-04-19 09:40:24 +05:30
Manideep Kurumella
3c2c8157c9 Merge "SGEMV performance improvement." into amd-staging-milan-3.1 2021-04-12 01:22:48 -04:00
mkurumel
f8525a888e SGEMV performance improvement.
1.bli_sdotxf_zen_int_8 :
               added hadd_ps intrinsic instead of dp_ps for
               add partial dot outputs.

AMD Internal  : [CPUPL-1512]

Change-Id: I6e8e71a9cf8c1f30a1710dd1c67f193a998beb03
2021-04-12 10:47:23 +05:30
Madan mohan Manokar
997133dc11 sup zgemm improvement
1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m.
2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done.

Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e
AMD-Internal: [CPUPL-1352]
2021-04-09 01:45:04 -04:00
Field G. Van Zee
6280757be3 Minor updates to a64fx section of Performance.md. 2021-04-07 13:03:56 -05:00
RuQing Xu
1e6ed823c6 Additional A64fx Comments (#490)
* Performance.md Update A64fx Comments

- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.

* Include Another Fix in armsve-cfg-vendor

A prototype was forgotten, causing that void* pointer was not fully returned.
2021-04-07 12:59:26 -05:00
Field G. Van Zee
2688f21a5b Added Fujitsu A64fx (512-bit SVE) perf results.
Details:
- Added single-threaded and multithreaded performance results to
  docs/Performance.md. These results were gathered on the "Fugaku"
  Fujitsu A64fx supercomputer at the RIKEN Center for Computational
  Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
  Nassyr for their work in developing and optimizing A64fx support in
  BLIS and RuQing for gathering the performance data that is reflected
  in these new graphs.
2021-04-06 19:02:37 -05:00
Field G. Van Zee
ba3ba8da83 Minor updates and fixes to test/3/octave scripts.
Details:
- Fixed an issue where the wrong string was being passed in for the
  vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
2021-04-06 18:39:58 -05:00
Kiran Varaganti
a7b7fcae59 Merge "Fix test_dotv.c when complex-return=intel" into amd-staging-milan-3.1 2021-04-06 10:01:45 -04:00
Kiran Varaganti
657f58b82b Fix test_dotv.c when complex-return=intel
When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes.
Now "return parameter" will become the first argument of these functions and these functions return void.
This fix addresses the change in the function declarations when --complex-return=intel is enabled.
Missing Trace and Log statements for this configuration are now  added.
[CPUPL-1376]

Change-Id: Ib420989da71839211c16088bf431a2ad775a3978
2021-04-06 15:38:12 +05:30
Madan Mohan Manokar
2aad3fbe55 Merge "disabled zgemm induced and gemm sqp temporarily." into amd-staging-milan-3.1 2021-04-05 05:51:12 -04:00
Madan mohan Manokar
7112b73d0d disabled zgemm induced and gemm sqp temporarily.
1. mx1, mx4 kernel addition and framework modification.
2. 8mx6n kernel addition.
3. NULL check added in dgemm_sqp malloc.
4. mem tracing added.
5. Restricted 3m_sqp to limited matrix sizes.
6. Induced methods disabled temporarily for debug.

AMD-Internal: [CPUPL-1352]
Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2
2021-04-04 20:43:03 +05:30
Devin Matthews
90508192f2 Update do_sde.sh (#489)
Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.
2021-03-30 21:16:44 -05:00
Nicholai Tukanov
22c6b5dc4c Fixed bug in power10 microkernel I/O. (#488)
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
  not store the microtile result correctly due to incorrect indices
  calculations. (The error was introduced when I reorganized the 
  'kernels/power10/3' directory.)
2021-03-30 19:07:42 -05:00
Dipal M Zambare
5562a27823 Added check for zero dimensions and early return in ?gemv and ?scal API.
If the one of the passed dimensions is zero, these API's will perform
early return and avoid crashes in case any other pointer inputs are null.

AMD-Internal: [SWLCSG-602]
Change-Id: Ibe8902beef286410707a2a88e94b933b49975c85
2021-03-26 21:08:22 +05:30
Field G. Van Zee
159ca6f01a Made test/3/octave scripts robust to missing data.
Details:
- Modified the octave scripts in test/3 so that the script does not
  choke when one or more of the expected OpenBLAS, Eigen, or vendor data
  files is missing. (The BLIS data set, however, must be complete.) When
  a file is missing, that data series is simply not included on that
  particular graph. Also factored out a lot of the redundant logic from
  plot_panel_4x5.m into a separate function in read_data.m.
2021-03-24 15:57:32 -05:00
Field G. Van Zee
545e6c2f6d CHANGELOG update (0.8.1) 2021-03-22 17:42:33 -05:00
Field G. Van Zee
8535b3e11d Version file update (0.8.1) 2021-03-22 17:42:33 -05:00
Field G. Van Zee
e56d9f2d94 ReleaseNotes.md update in advance of next version. 2021-03-22 17:40:50 -05:00
Field G. Van Zee
ca83f955d4 CREDITS file update. 2021-03-22 17:21:21 -05:00
Field G. Van Zee
57ef61f6cd Merge branch 'master' of github.com:flame/blis 2021-03-19 13:05:43 -05:00
Field G. Van Zee
bf1b578ea3 Reduced KC on skx from 384 to 256.
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
  from 384 to 256. The maximum (extended) KC was also reduced
  accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
  this change.
2021-03-19 13:03:17 -05:00
Nicholai Tukanov
e7a4a8edc9 Fix calculation of new pb size (#487)
Details:
- Added missing parentheses to the i8 and i4 instantiations of the
  GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.
2021-03-17 19:43:31 -05:00
Field G. Van Zee
4493cf516e Redefined BLIS_NUM_ARCHS to update automatically.
Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
  value in the arch_t enum. This means that it no longer needs to get
  updated manually whenever new subconfigurations are added to BLIS.
  Also removed the explicit initial index assigment of 0 from the
  first enum value, which was unnecessary due to how the C language
  standard mandates indexing of enum values. Thanks to Devin Matthews
  for originally submitting this as a PR in #446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
  change.
2021-03-15 13:12:49 -05:00
Field G. Van Zee
a4b73de84c Disabled _self() and _equal() in bli_pthread API.
Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
  introduced in d479654. These functions were disabled after I realized
  that they aren't actually needed yet. Thanks to Devin Matthews for
  helping me reason through the appropriate consumer code that will
  appear in BLIS (eventually) in a future commit. (Also, I could never
  get the Windows branch to link properly in clang builds in AppVeyor.
  See the comment I left in the code, and #485, for more info.)
2021-03-12 19:47:39 -06:00
Field G. Van Zee
f9d604679d Added _self() and _equal() to bli_pthread API.
Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
  and pthread_equal(). Implemented these two functions for all three cpp
  branches present within bli_pthread.c: systemless, Windows, and
  Linux/BSD.
2021-03-12 19:47:39 -06:00
Field G. Van Zee
fa9b3c8f6b Shuffled code in Windows branch of bli_pthreads.c.
Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
  defines the bli_pthreads API in terms of Windows API calls. Also added
  missing comments that mark sections of the API, which brings the code
  into harmony with other cpp branches (as well as bli_pthread.h).
2021-03-11 15:13:51 -06:00
Field G. Van Zee
95d4f3934d Moved cpp macro redef of strerror_r to bli_env.c.
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
  (in terms of strerror_s) from bli_thread.h to bli_env.c. It was
  likely left behind in bli_thread.h in a previous commit, when code
  that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
  find any other instance of strerror_r being used in BLIS, so I moved
  the #define directly to bli_env.c rather than place it in bli_env.h.)
  The code that uses strerror_r is currently disabled, though, so this
  commit should have no affect on BLIS.
2021-03-11 13:50:40 -06:00
Madan Mohan Manokar
4f19ef8339 Merge "3m_sqp vectorization" into amd-staging-milan-3.1 2021-03-10 02:03:23 -05:00
Madan mohan Manokar
a424e8b426 3m_sqp vectorization
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.

AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
2021-03-10 11:54:50 +05:30
Field G. Van Zee
8a3066c315 Relocated gemmsup_ref general stride handling.
Details:
- Moved the logic that checks for general stridedness in any of the
  matrix operands in a gemmsup problem. The logic previously resided
  near the top of bli_gemmsup_int(), which is the thread entry point
  for the parallel region of the current gemmsup implementation. The
  problem with this setup was that the code would attempt to reject
  problems with any general-strided operands by returning BLIS_FAILURE,
  and that return value was then being ignored by the l3_sup thread
  decorator, which unconditionally returns BLIS_SUCCESS. To solve this
  issue, rather than try to manage n return values, one from each of n
  threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
  move it any higher (e.g. bli_gemmsup()) because I still want the
  logic to be part of the current gemmsup handler implementation. That
  is, perhaps someone else will create a different handler, and that
  author wants to handle general stride differently. (We don't want to
  force them into a particular way of handling general stride.)
- Removed the general stride handling from bli_gemmtsup_int(), even
  though this function is inoperative for now.
- This commit addresses issue #484. Thanks to RuQing Xu for reporting
  this issue.
2021-03-09 17:52:59 -06:00
nphaniku
e3cc577ec1 AOCL Windows: 3.1 BLIS changes
1. Incorporated code review comments .
 2. Updated Copyright to 2021.

AMD Internal : [CPUPL-1422]

Change-Id: I722b0f71daae029a3dcc2cbd029524ea39ca78e6
2021-03-09 17:35:57 +05:30
nphaniku
d78defa0fc AOCL Windows: 3.1 BLIS changes
1. CMake script changes for adding new files to the build.
 2. Added Upper case support for couple of API's.
 3. bool is not support in clang so defined it.

AMD Internal : [CPUPL-1422]

Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
2021-03-08 22:32:13 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
Kiran Varaganti
12d13629f9 Fix Debug Trace Log in dgemm_ and zgemm_
Replaced "*MKSTR(ch)" in the DTL call "AOCL_DTL_LOG_GEMM_INPUTS(AOCL_DTL_LEVEL_TRACE_1, *MKSTR(ch)...)" with "D" and "Z" for dgemm_ and zgemm_ respectively to prevent printing wrong data-type.

[CPUPL-1449]

Change-Id: Ic91537189352bdb164411799e127de990a5c9a08
2021-03-02 15:16:21 +05:30
RuQing Xu
b8dcc5bc75 Fixed typed API definition for gemmt (#476)
Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in 
  frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
  defined identically to gemm, which was wrong because it did not
  take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
  Specifically, the document erroneously listed only a single transab
  parameter instead of transa and transb.
2021-03-01 16:58:24 -06:00
Ilknur
a0e4fe2340 Fixed double free() in level1v example (#482)
Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
  pointer 'a' was not being freed at all. This commit correctly frees 
  each pointer exactly once.
2021-03-01 16:06:56 -06:00