Commit Graph

2496 Commits

Author SHA1 Message Date
Nallani Bhaskar
75f72b7f6e Added aocl dynamic feature for dtrsm for small sizes
Details:
1. Added aocl-dynamic for dtrsm native path
   When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
   trsm_small is doing better than native irrespective of
   number of threads

Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
2021-06-18 08:46:47 -04:00
Chandrashekara KR
d7377f967c Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1 2021-06-17 08:49:55 -04:00
Kiran Varaganti
d26089c665 Multi-threaded BLIS - OpenMP
Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application
by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes
are not set, number of threads is inferred by calling the API - omp_get_max_threads().
Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads.
This feature is only enabled when BLIS is configured with openmp parallelization.

Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6
2021-06-17 05:17:37 -04:00
Meghana Vankadari
d5ff5e5f50 Added dynamic threading support for SYRK SUP code path
Details:
- when AOCL dynamic is enabled, the decision to choose ST Vs MT
  to solve SYRK is taken based on dimensions of matrices.
- Decisions to choose optimum number of threads will be updated in
  the subsequent commits.
- Only local copy of rntm is modified by AOCL Dynamic feature.
  global_rntm data structure remains unchanged in order to keep
  track of original number of threads set by application.
- Added an early-exit condition in bli_nthreads_optimum when nt =1
  or nt=-1. This ensures that AOCL dynamic feature is not used when
  threading is set using BLIS_IC_NT or BLIS_JC_NT.

Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0
2021-06-16 02:08:11 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Chandrashekara K R
f94e3ad237 AOCL-Windows: Update BLIS build system
1. Added support in cmake scripts for linking libomp for blis multithreading build.
 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.

AMD Internal : [CPUPL-1630]

Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
2021-06-15 16:49:08 +05:30
Kiran Varaganti
c2abbcab96 Fix dgemm_ Multi-thread running as Single Thread
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.

Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
2021-06-15 12:14:11 +05:30
Nageshwar Singh
3002239f83 Added bench utility for swapv API
AMD-Internal: [CPUPL-1591]
Change-Id: I5619d402db49d1f325e4293f3be7a8bc0dde6f15
2021-06-09 17:05:00 +05:30
Nageshwar Singh
6ca50e1b72 Added bench utility for copyv API
AOCL-Internal: [CPUPL-1591]
Change-Id: I00ddad565cb87cd9371d7b1df2b57394fef437e0
2021-06-09 12:29:49 +05:30
satish kumar nuggu
8885136786 Added prefetch in gemm module for single threaded dtrsm small for right cases
Details:

1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases.
2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL.

Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605
2021-06-08 17:39:23 +05:30
Nageshwar Singh
6842c2a30e Bench trsv logging error
Details
  - Passing enum rather than char for uplo, transa, and diaga
  - Deleting log file, and other temp files, merged in the codebase from amax

AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
2021-06-08 11:54:55 +05:30
Dipal Madhukar Zambare
1638ff7605 Merge "DTL logs corrections" into amd-staging-milan-3.1 2021-06-06 23:20:22 -04:00
Nageshwar Singh
61b7584580 Bench addition for amaxv API
AOCL-Internal: [CPUPL-1591]
Change-Id: Ia9754dfed1a7302d5c267858f9005c8f64e28b46
2021-06-04 17:45:04 +05:30
Nageshwar Singh
ecfbdd16a8 Added bench utility for trsv API
AOCL-Internal: [CPUPL-1591]
Change-Id: I5953e13e9c75f620987ea92d92d1b1d7b5bfd043
2021-06-04 08:05:37 -04:00
Dipal M Zambare
2f344f5df1 DTL logs corrections
-- Fixed issues in printing the values of
     side, uploa and diaga parameters for
     hemm, hemv, her, her2, her2k, herk,
     symm, symv, syr, syr2, syr2k, syrk,
     trmm, trmv, trsm, trsv.
  -- For above API's logging was called with MKSTR()
     for side, uploa and diaga parameters. MKSTR is
     needed only for macro arguments but not
     for function's arguments.
  -- Added space between function name and data type
     where it was missing. Bench expects logs in
     this format.

AMD-Internal: [CPUPL-1585]
Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d
2021-06-04 15:24:13 +05:30
Dipal M Zambare
849e1cee0a Updated version number to 3.0.1.
Change-Id: I07d5c26bb96b590854e1f81d41ed49a5e320f60e
2021-06-03 15:48:05 +05:30
Nagarapu Phanikumar
7ea32e6d0b Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1 2021-06-03 06:03:26 -04:00
nphaniku
2bdee3cd6c Unifying BLIS Windows and Linux codebase
1. Removed dependency on bli_config.h inclusion in blis.h
 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags.
 3. CMAKE changes to incorporate new changes as per 3.1 code base.
 4. Removed zen2 folder from Windows directory.

AMD Internal : [CPUPL-1532]

Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47
2021-06-03 15:28:10 +05:30
mkurumel
9afbb11b4f DTL Logging bug in GEMV
Details :
  - Fixed Incorrect Macro used in dgemv and cgemv Trace logging exit.

AMD-Internal: [CPUPL-1403]
Change-Id: Icac502d8d4adad112754d9c764a30d3db56a743f
2021-06-02 21:21:00 +05:30
mkurumel
99e3bce065 SGEMV : single Precision axpyf kernel optimization for SGEMV
Details :
  - Implemented saxpyf kernel with fuse factor=6 for sgemv.

AMD-Internal: [CPUPL-1403]
Change-Id: I72fd30c08a789603267cf58910138549d45d231a
2021-06-02 07:55:48 -04:00
Nageshwar Singh
2e1a5bc1dd Optimized double complex axpyf kernel for zgemv
Details:
  - Implemented zaxpyf kernel with fuse factor=4 for zgemv.
  - Modified BLAS interface call for zgemv to reduce framework overhead.
  - Directed gemv to dotv in the case where dimension of y vector is 1.
  - when alpha = 0, gemv becomes scalv of Y with beta. Added code to
    return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
2021-06-01 18:03:29 +05:30
Meghana Vankadari
3804e301c9 Fixed a bug in Level-3 bench files where ldc = 1
Details:
- To determine whether matrices are col-stored, we were checking
  ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1
  if dimension is 1.
- Modified the condition to check row_stride instead of col stride.
  if row-stride != 1, we can assume that matrices are not col-stored
  and ignore those inputs by printing an error message.

Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87
2021-06-01 10:57:18 +05:30
Kiran Varaganti
ff84d37930 Merge "SUP GEMM - Enable only block panel (var2m)" into amd-staging-milan-3.1 2021-05-31 06:46:04 -04:00
Meghana Vankadari
887ecb46e0 Added threshold logic for SYRK
Details:
- Added decision logic to choose between SUP and native implementations
  of SYRK for zen2 architectures.
- For architectures other than zen2 it will be redirected to gemm
  threshold function.

Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d
2021-05-31 05:34:38 -04:00
Kiran Varaganti
aa9f5b8b37 SUP GEMM - Enable only block panel (var2m)
Completely disabling supvar1n (Panel Block) gemm to simplify things
supvar1n perform better only when m >> and n=k=small (<10). This
simplification will improve performance for m = n shape dgemm.

Change-Id: I523fcb211e8ab92718ea7367f9707a38275e24b1
2021-05-30 21:22:44 +05:30
Madan mohan Manokar
6d6f746190 3m1 turning OFF
since 3m1 is turned off in bla_gemm.c, setting FALSE for 3m1 in bli_l3_ind_oper_st

AMD-Internal: [CPUPL-1592]
Change-Id: I80dfe7c993f9edfbf752b7351cfdaa22a9e60035
2021-05-26 10:06:54 +05:30
Kiran Varaganti
ae6b6a7b7c Merge "Fix a bug in bench_gemm.c" into amd-staging-milan-3.1 2021-05-25 05:00:05 -04:00
Meghana Vankadari
4446395047 Redirecting dgemv to axpyf based implementation for smaller sizes.
AMD-Internal: [CPUPL-1403]
Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994
2021-05-25 01:39:12 -04:00
satish kumar nuggu
82087773a0 Optimized single threaded dtrsm small for right cases
Details:

1. Added optimized dtrsm kernels for all 8 right side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a01 block required for GEMM operation. The block size
      increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
      from one end of A to other end of triangular A
   b. Packing of 6 diagonal elements in one location helped to utilize
      cache line efficiently

      AMD-Internal: [CPUPL-1563]

Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
2021-05-25 01:09:50 -04:00
Meghana Vankadari
8c9a7c21b4 Optimized axpyf kernel for scomplex datatype
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
2021-05-24 14:40:17 +05:30
Kiran Varaganti
492f54fb5e Fix a bug in bench_gemm.c
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.

Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
2021-05-24 12:59:28 +05:30
Dipal M Zambare
5f53d14971 Added bench utility for dotv and scalv APIs.
- Added bench utility for dotv and scalv API's
   - Corrected logging for scalv to handle complex types
   - Corrected logging to remove transpose field from dotv logs

AOCL-Internal: [CPUPL-1577]
Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07
2021-05-21 10:00:32 +05:30
Dipal Madhukar Zambare
dac15bdb3f Merge "Added bench utility for ger API." into amd-staging-milan-3.1 2021-05-19 08:17:09 -04:00
Dipal Madhukar Zambare
b2f7c7f019 Merge "Fixed crash issue in bench utility for gemv API" into amd-staging-milan-3.1 2021-05-19 08:15:40 -04:00
Dipal M Zambare
413814fe70 Fixed crash issue in bench utility for gemv API
- incx and incy was not considered while allocating
    memory for x and y vectors.
  - Updated test data set

AMD-Internal: [CPUPL-1578]

Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05
2021-05-19 14:21:09 +05:30
Dipal M Zambare
0e82783f1c Added bench utility for ger API.
AOCL-Internal: [CPUPL-1577]
Change-Id: Icc7a4590f605d7273077a7d2a42d4ecbafed2248
2021-05-19 14:05:01 +05:30
Nallani Bhaskar
b2e68b9812 Merge "Added optimized single threaded dtrsm small for left cases" into amd-staging-milan-3.1 2021-05-19 00:47:56 -04:00
Nallani Bhaskar
3a2e4c3db8 Added optimized single threaded dtrsm small for left cases
Details:

1. Added optimized dtrsm kernels for all 8 left side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a10 block required for GEMM operation. The block size
      increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
      from one end of A to other end of triangular A
   b. Performing inregister transpose whenever required
   c. Packing of 8 diagonal elements in one location helped to utilize
      cache line efficiently

2. Enabled calling dtrsm small for smaller sizes at cblas level itself
   to avoid frame work overhead, which is significant for very small
   sizes

3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
   and manideep.kurumella@amd.com for implementing lut kernels
   using intrinsics.

4. Removed all older implementations of strsm which are not
   developed as per the guide lines, can be refered from
   older releases if required.

Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6
2021-05-18 16:16:00 +05:30
Dipal Madhukar Zambare
1605fea83e Merge "Re-merged the gemmt testsuite file." into amd-staging-milan-3.1 2021-05-18 04:20:11 -04:00
Dipal Madhukar Zambare
6d8e2a36b3 Merge "Fixed blastest failure for amd64 configuration" into amd-staging-milan-3.1 2021-05-18 03:57:18 -04:00
Dipal M Zambare
9e27065c2b Fixed blastest failure for amd64 configuration
- When building for amd64 configuration, small matrix support
     for dgemm is not enabled (yet). Functions supporting small matrix
     implementation are called even when small matrix support is
     disabled. Update code to prevent this.

AMD-Internal: [CPUPL-1575]
Change-Id: I3a1692e965679cfde44938b1d26951145c790aa0
2021-05-18 03:24:06 -04:00
Nallani Bhaskar
a59796ef16 Updated leading dimensions for transpose case in gemm bench
1. Updated lda, ldb based on trans flags
2. Updated deriving storage type using leading dimension
2. Cleanup and alignment
3. Included transpose and row major cases in inputgemm.txt

Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6
2021-05-14 15:26:20 +05:30
Dipal M Zambare
21130ebece Added configure option for AOCL Dynamic feature.
- AOCL Dynamic feature is added in BLIS which determines optimal
    number of threads for the current problem size.
  - This feature can be enabled/disabled by modifying the source
    code
  - This change adds support to enable/disable this feature during
    configuration time by adding a new option in configure script

AOCL-Internal : [CPUPL-1565]

Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
2021-05-12 00:41:13 -04:00
Meghana Vankadari
a3600d395d Added bench app for syrk - input is a log file generated from AOCL_DTL
Change-Id: I25dd695dea267a89a5c666d66abc4b91a57956c8
2021-05-11 14:57:51 +05:30
Dipal Madhukar Zambare
2b80e8824a Merge "Added bench utility for gemv API." into amd-staging-milan-3.1 2021-05-11 01:09:22 -04:00
Dipal M Zambare
08424e8896 Added bench utility for gemv API.
AMD-Internal: [CPUPL-1558]
Change-Id: Iaba1aa164fa589fa7f5047f314b26a24c4c2c3a7
2021-05-10 15:01:47 +05:30
Nageshwar Singh
a88cb82cec Revert "Adding trans h support in bench_gemm.c"
This reverts commit 791903b31c.

Change-Id: I24403cced67ea9e851adb58a8bf01a3e17bb4e85
2021-05-07 04:11:30 -04:00
Meghana Vankadari
dc2d6ee763 Moved dynamic threading function from GEMMT to GEMM
Details:
- Current tuning for choosing optimal number of threads is done for
  GEMM.
- Dynamic thread calculation function was placed in gemmt code flow
  instead of gemm by mistake. Fixing it with this commit.

AMD-Internal: [CPUPL-1376]
Change-Id: Iccb42a7a617b9b4cdb4c4af9be21aa82aaaabbcc
2021-05-07 12:10:53 +05:30
Meghana Vankadari
33ddf2e448 Fixed blastest failure for haswell configuration
Details:
- Placed optimized version of BLAS DGEMM, ZGEMM definitions under
  BLIS_CONFIG_EPYC as they use gemm small which are defined only
  for zen family configurations.
- Added code to query and set cntx in gemv and trsv framework before
  cntx is referred for any function pointers to avoid querying
   from NULL pointer.

AMD-Internal: [CPUPL-1562]
Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9
2021-05-07 01:49:54 -04:00
Meghana Vankadari
eea347b02e Added dynamic threading support for GEMM SUP code path
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
  is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
  where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
  or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
  global_rntm data structure remains unchanged in order to keep track of
  original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
  number of threads recommended by the application.

AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
2021-05-07 09:52:51 +05:30