Commit Graph

2471 Commits

Author SHA1 Message Date
Madan mohan Manokar
6d6f746190 3m1 turning OFF
since 3m1 is turned off in bla_gemm.c, setting FALSE for 3m1 in bli_l3_ind_oper_st

AMD-Internal: [CPUPL-1592]
Change-Id: I80dfe7c993f9edfbf752b7351cfdaa22a9e60035
2021-05-26 10:06:54 +05:30
Kiran Varaganti
ae6b6a7b7c Merge "Fix a bug in bench_gemm.c" into amd-staging-milan-3.1 2021-05-25 05:00:05 -04:00
Meghana Vankadari
4446395047 Redirecting dgemv to axpyf based implementation for smaller sizes.
AMD-Internal: [CPUPL-1403]
Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994
2021-05-25 01:39:12 -04:00
satish kumar nuggu
82087773a0 Optimized single threaded dtrsm small for right cases
Details:

1. Added optimized dtrsm kernels for all 8 right side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a01 block required for GEMM operation. The block size
      increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
      from one end of A to other end of triangular A
   b. Packing of 6 diagonal elements in one location helped to utilize
      cache line efficiently

      AMD-Internal: [CPUPL-1563]

Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
2021-05-25 01:09:50 -04:00
Meghana Vankadari
8c9a7c21b4 Optimized axpyf kernel for scomplex datatype
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
2021-05-24 14:40:17 +05:30
Kiran Varaganti
492f54fb5e Fix a bug in bench_gemm.c
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.

Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
2021-05-24 12:59:28 +05:30
Dipal M Zambare
5f53d14971 Added bench utility for dotv and scalv APIs.
- Added bench utility for dotv and scalv API's
   - Corrected logging for scalv to handle complex types
   - Corrected logging to remove transpose field from dotv logs

AOCL-Internal: [CPUPL-1577]
Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07
2021-05-21 10:00:32 +05:30
Dipal Madhukar Zambare
dac15bdb3f Merge "Added bench utility for ger API." into amd-staging-milan-3.1 2021-05-19 08:17:09 -04:00
Dipal Madhukar Zambare
b2f7c7f019 Merge "Fixed crash issue in bench utility for gemv API" into amd-staging-milan-3.1 2021-05-19 08:15:40 -04:00
Dipal M Zambare
413814fe70 Fixed crash issue in bench utility for gemv API
- incx and incy was not considered while allocating
    memory for x and y vectors.
  - Updated test data set

AMD-Internal: [CPUPL-1578]

Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05
2021-05-19 14:21:09 +05:30
Dipal M Zambare
0e82783f1c Added bench utility for ger API.
AOCL-Internal: [CPUPL-1577]
Change-Id: Icc7a4590f605d7273077a7d2a42d4ecbafed2248
2021-05-19 14:05:01 +05:30
Nallani Bhaskar
b2e68b9812 Merge "Added optimized single threaded dtrsm small for left cases" into amd-staging-milan-3.1 2021-05-19 00:47:56 -04:00
Nallani Bhaskar
3a2e4c3db8 Added optimized single threaded dtrsm small for left cases
Details:

1. Added optimized dtrsm kernels for all 8 left side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a10 block required for GEMM operation. The block size
      increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
      from one end of A to other end of triangular A
   b. Performing inregister transpose whenever required
   c. Packing of 8 diagonal elements in one location helped to utilize
      cache line efficiently

2. Enabled calling dtrsm small for smaller sizes at cblas level itself
   to avoid frame work overhead, which is significant for very small
   sizes

3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
   and manideep.kurumella@amd.com for implementing lut kernels
   using intrinsics.

4. Removed all older implementations of strsm which are not
   developed as per the guide lines, can be refered from
   older releases if required.

Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6
2021-05-18 16:16:00 +05:30
Dipal Madhukar Zambare
1605fea83e Merge "Re-merged the gemmt testsuite file." into amd-staging-milan-3.1 2021-05-18 04:20:11 -04:00
Dipal Madhukar Zambare
6d8e2a36b3 Merge "Fixed blastest failure for amd64 configuration" into amd-staging-milan-3.1 2021-05-18 03:57:18 -04:00
Dipal M Zambare
9e27065c2b Fixed blastest failure for amd64 configuration
- When building for amd64 configuration, small matrix support
     for dgemm is not enabled (yet). Functions supporting small matrix
     implementation are called even when small matrix support is
     disabled. Update code to prevent this.

AMD-Internal: [CPUPL-1575]
Change-Id: I3a1692e965679cfde44938b1d26951145c790aa0
2021-05-18 03:24:06 -04:00
Nallani Bhaskar
a59796ef16 Updated leading dimensions for transpose case in gemm bench
1. Updated lda, ldb based on trans flags
2. Updated deriving storage type using leading dimension
2. Cleanup and alignment
3. Included transpose and row major cases in inputgemm.txt

Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6
2021-05-14 15:26:20 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
Meghana Vankadari
a3600d395d Added bench app for syrk - input is a log file generated from AOCL_DTL
Change-Id: I25dd695dea267a89a5c666d66abc4b91a57956c8
2021-05-11 14:57:51 +05:30
Dipal Madhukar Zambare
2b80e8824a Merge "Added bench utility for gemv API." into amd-staging-milan-3.1 2021-05-11 01:09:22 -04:00
Dipal M Zambare
08424e8896 Added bench utility for gemv API.
AMD-Internal: [CPUPL-1558]
Change-Id: Iaba1aa164fa589fa7f5047f314b26a24c4c2c3a7
2021-05-10 15:01:47 +05:30
Nageshwar Singh
a88cb82cec Revert "Adding trans h support in bench_gemm.c"
This reverts commit 791903b31c.

Change-Id: I24403cced67ea9e851adb58a8bf01a3e17bb4e85
2021-05-07 04:11:30 -04:00
Meghana Vankadari
dc2d6ee763 Moved dynamic threading function from GEMMT to GEMM
Details:
- Current tuning for choosing optimal number of threads is done for
  GEMM.
- Dynamic thread calculation function was placed in gemmt code flow
  instead of gemm by mistake. Fixing it with this commit.

AMD-Internal: [CPUPL-1376]
Change-Id: Iccb42a7a617b9b4cdb4c4af9be21aa82aaaabbcc
2021-05-07 12:10:53 +05:30
Meghana Vankadari
33ddf2e448 Fixed blastest failure for haswell configuration
Details:
- Placed optimized version of BLAS DGEMM, ZGEMM definitions under
  BLIS_CONFIG_EPYC as they use gemm small which are defined only
  for zen family configurations.
- Added code to query and set cntx in gemv and trsv framework before
  cntx is referred for any function pointers to avoid querying
   from NULL pointer.

AMD-Internal: [CPUPL-1562]
Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9
2021-05-07 01:49:54 -04:00
Meghana Vankadari
eea347b02e Added dynamic threading support for GEMM SUP code path
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
  is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
  where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
  or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
  global_rntm data structure remains unchanged in order to keep track of
  original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
  number of threads recommended by the application.

AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
2021-05-07 09:52:51 +05:30
Dipal M Zambare
29bfedad30 Re-merged the gemmt testsuite file.
- Verified merge of all gemmt related files
  - Corrected testsuite/src/test_gemmt.c

AMD-Internal: [CPUPL-1561]
Change-Id: I5fe03b8e3754e4ed96c927ef7570be6f9d4f528b
2021-05-06 18:08:28 +05:30
Kiran Varaganti
433f17b6cd bench_gemmt Bug Fix
Fix reading input parameters

Interchange the reading of n and k, first n appears and then k appears in the logs.
Added comments to explain the format of the input gemmt log.

Change-Id: I44c6081d4449ba210728bc089c4215d5eef18834
2021-05-06 14:54:15 +05:30
managalv
c420bd63e2 Enabled optimised packed routines on zen3
Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338
2021-05-06 01:25:08 -04:00
Madan mohan Manokar
c1fa9abe32 zgemm native path tuning
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.

AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
2021-05-06 01:15:35 -04:00
Dipal Madhukar Zambare
821fa267c9 Merge "Updated makefiles to fix issues introduced in merge" into amd-staging-milan-3.1 2021-05-05 23:42:15 -04:00
Meghana Vankadari
dc71602895 Merge "Added sup functionality for SYRK" into amd-staging-milan-3.1 2021-05-05 06:26:03 -04:00
Dipal M Zambare
7454cca9e7 Updated makefiles to fix issues introduced in merge
- Updated Makefile to include DTL files in library build
   - Updated Makefile to include cpp header file installation
   - Updated test/makefile to include extra API added by AMD team.

AMD-Internal: [CPUPL-1559]
Change-Id: I249c6935d5ff5fb645f9deec7e0218575484be13
2021-05-05 14:59:15 +05:30
Nallani Bhaskar
f917d826b5 Updated test application to work with row major cblas
Details:
1. Fixed reading leading dimenstions in test_gemm.c based on row/col major
2. Reduced redundent code and adjusted alignment

Change-Id: I8ca8c81223386fc21c6cc7c1d8f8a2109c9f5343
2021-05-02 23:09:13 +05:30
Meghana Vankadari
1303732e83 Added sup functionality for SYRK
Details:
- Added bli_syrksup function that internally uses gemmt implementation.
- Modified OAPI of syrk to call SUP before proceeding to the
  conventional implementation.
- Copied gemmsup threshold function for syrk temporarily. Thresholds are
  yet to be derived for syrk.

Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d
2021-04-29 12:35:30 +05:30
Nallani Bhaskar
b239a5aee7 Bug fix in sgemmsup 1x16n kernel
Details:

Address increment was missing in bli_sgemmsup_rv_zen_asm_1x16 kernel
while storing output in column major order in beta zero case

JIRA: CPUPL-1548

Change-Id: I36269cd28de6fbef2256451e399f90f0437b0ce1
2021-04-28 21:33:30 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Satish Kumar Nuggu
743732c939 Merge "Added CBLAS API interface and memory alignment check" into amd-staging-milan-3.1 2021-04-26 04:27:21 -04:00
satish kumar nuggu
c0d16f70e4 Added CBLAS API interface and memory alignment check
Details:
1. Added CBLAS API in test_gemm.c and test_trsm.c
2. Creating matrixes with memory aligned leading dimensions irrespective of dimension.
3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end.

   AMD-Internal: [CPUPL-1500] [CPUPL-1450]

Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a
2021-04-26 13:44:30 +05:30
Field G. Van Zee
6a4aa986ff Fixed typo in Table of Contents. 2021-04-23 13:10:01 -05:00
Field G. Van Zee
f6424b5b82 Added dedicated Performance section to README.md.
Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
  Documentation section into a new Performance section dedicated to
  those two links. (The previous entries remain redundantly listed
  within Documentation section.) Thanks to Robert van de Geijn for
  suggesting this change.
2021-04-23 13:08:06 -05:00
Madan mohan Manokar
f6088ac1cf Enabling 3m_sqp and 3m1 methods
1. Re-enabling 3m methods for zgemm.
2. Vectorization of pack_sum routines re-enabled with bug fix.
3. 8mx6n kernel added.

AMD-Internal: [CPUPL-1352]
Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810
2021-04-22 02:47:31 -04:00
Meghana Vankadari
d68f427ced Merge "Added bench app for gemmt - input is a log file generated from AOCL DTL" into amd-staging-milan-3.1 2021-04-22 00:20:16 -04:00
Devin Matthews
40ce5fd241 Merge pull request #493 from cassiersg/patch-1
Fix typo in FAQ.md
2021-04-21 09:54:25 -05:00
Gaëtan Cassiers
1f3461a5a5 Fix typo in FAQ.md 2021-04-21 16:49:05 +02:00
Mangala V
9e147912ee Merge "ZGEMM SUP: Removed unused assembly intructions" into amd-staging-milan-3.1 2021-04-19 03:08:31 -04:00
managalv
6c3741cd3e ZGEMM SUP: Removed unused assembly intructions
Removed memory operations which were being unused
Modified labels to be unquie to a file
Rowstride update is done at once to avoid multiple mul instruction

AMD Internal  : [CPUPL-1419]

Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103
2021-04-19 20:31:03 +05:30
Meghana Vankadari
713ca659b5 Added bench app for gemmt - input is a log file generated from AOCL DTL
Change-Id: Ia3390b529244f529d9741c86a6f8dc35a589f714
2021-04-19 09:40:24 +05:30
Manideep Kurumella
3c2c8157c9 Merge "SGEMV performance improvement." into amd-staging-milan-3.1 2021-04-12 01:22:48 -04:00
mkurumel
f8525a888e SGEMV performance improvement.
1.bli_sdotxf_zen_int_8 :
               added hadd_ps intrinsic instead of dp_ps for
               add partial dot outputs.

AMD Internal  : [CPUPL-1512]

Change-Id: I6e8e71a9cf8c1f30a1710dd1c67f193a998beb03
2021-04-12 10:47:23 +05:30
Madan mohan Manokar
997133dc11 sup zgemm improvement
1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m.
2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done.

Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e
AMD-Internal: [CPUPL-1352]
2021-04-09 01:45:04 -04:00