Commit Graph

1254 Commits

Author SHA1 Message Date
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Shubham Sharma
9607f207da AOCL Dynamic tuning for DAXPYV
- Existing logic is not picking the ideal number
  of threads for some problem sizes.
- Problem size and their corresponding ideal number
  of threads are retuned for daxpy in aocl dynamic.

AMD-Internal: [CPUPL-3484]
Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7
2023-08-01 10:34:04 +05:30
Shubham Sharma
954c97f858 Added NT in DTL logs for GEMMT, TRSM and NRM2
- Number of threads and gflops are added
  in the DTL logs for GEMMT, TRSM and NRM2

AMD-Internal: [CPUPL-2144]
Change-Id: If68887a5150bd0feda351180f379996497a1e678
2023-07-27 05:15:08 -04:00
Meghana Vankadari
79e174ff0a Level-3 triangular routines now use different block sizes and kernels.
Details:
    - Eliminated the need for override function in SUP for GEMMT/SYRK.
    - New set of block sizes, kernels and kernel preferences
      are added to cntx data structure for level-3 triangular routines.
    - Added supporting functions to set and get the above parameters from cntx.
    - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
      In case they are not set, use the default block sizes/kernels of
      Level-3 SUP.

AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
2023-07-26 01:26:11 -04:00
Harihara Sudhan S
ffbb0e83e5 ZGEMM optimization for cases when m = 1 or n = 1
- When n = 1 and A matrix is transposed ZGEMV row major variant is
  invoked.
- When m = 1 and B matrix is not transposed ZGEMV row major variant
  is invoked.
- This redirection happens before parallel ZGEMM check. This is done to
  avoid the unneccesary condition check. Any parallelization check is
  expected to happen in the invoked ZGEMV interface.

AMD-Internal: [CPUPL-2773]
Change-Id: I6b7b31db712edc682c089475d12e98730a960138
2023-06-30 04:54:42 -04:00
Edward Smyth
94a4abe2e5 BLIS: Incorrect ifdef in cblas.h and cblas_f77.h
Remove unnecessary ifdef BLIS_ENABLE_CBLAS statement from cblas.h
and cblas_f77.h. These were erroneously added when fixing the
--disable-blas functionality but are not needed in the CBLAS
headers, as these files will not be generated when BLAS or CBLAS
is disabled.

This is a fix to commit 5bd2a777ba

AMD-Internal: [CPUPL-3541]
Change-Id: If38bd795d31098a7023d575672b0a913338c0d2d
2023-06-07 06:52:57 -04:00
Mangala V
5f5bc24989 Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().

Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.

AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
2023-05-21 23:13:46 +05:30
Harihara Sudhan S
9ee95e171a Control flow issue reported during static code analysis
- Missing break statement will result in unexpected control flow.
  This function will not launch the threads for the API in question
  according to the AOCL dynamic logic without the break statement.

AMD-Internal: [CPUPL-3436]
Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698
2023-05-18 04:53:03 -04:00
vignbala
9164427e86 Code cleanup: Mismatch in assembly macros
- In the bli_x86_asm_macros.h file, the set of vinsertf?x? and
  vextractf?x? instructions are facing macro expansion errors due to
  ambiguous macro redirection. The lower-case macro definitions of
  these instructions are not properly redirected to their corresponding
  upper-case macro definitions.

- This error occurs due to ambiguity in the upper-case macro name.
  At the place of lower-case macro definition, the redirection is to
  macros of the form VINSERTF?x? and VEXTRACTF?x?, while at the place
  of upper-case macro definition, they are of the form VINSERTF?X? and
  VEXTRACTF?X?. This causes a mismatch of the upper-case macro due to
  different case sensitive 'x' being used.

- This patch corrects this issue, by changing the lower-case 'x' to
  upper-case, among the upper case macros at the place of redirection.
  This provides uniformity and facilitates the expected macro-expansion.

AMD-Internal: [CPUPL-3276]
Change-Id: Id1f45f8e4bb083cd4b87632b713ff6baba616ff2
2023-05-04 08:49:58 -04:00
Harihara Sudhan S
a6621f1241 Incorrect accumulation of results in DDOTV
- When the number of threads launched is not equal to the
  number of threads requested the garbage value in the created
  buffer will not be overwritten by valid values.
- To handle the above scenario, the created temporary buffer is
  initialized with zeroes.

AMD-Internal: [CPUPL-3268]
Change-Id: I439a1da18eb1b380491fea14f42b0ede05ccf5a9
2023-05-04 10:44:15 +05:30
Harihara Sudhan S
828ac8e2dd Partial completion of work in L1 APIs
- Partial completion of compute was happening since BLIS was unable
  to launch the required number of threads. This was because rntm
  was returning a thread count greater than the maximum number of
  threads that can be launched in the subsequent parallel region.
- Added 'omp_get_num_threads' inside the parallel regions to get the
  actual number of threads spawned. The work distribution happens
  based on the actual number of threads launched in that region.

AMD-Internal: [CPUPL-3268]
Change-Id: I086ad4b9b644f966b7bab439e43222396f0c2bf0
2023-04-27 15:17:26 +05:30
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Arnav Sharma
4aace5f524 Smart Threading for SGEMM SUP for Zen4 Architecture
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
  threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
  redirect few Native inputs to the SUP path based on the fact that in a
  multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
  been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
  subsequently.

AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
2023-04-21 12:54:03 +05:30
Harsh Dave
b85b856950 Added Doxygen support for extension APIs.
Details:
- Added Doxyfile, a configuration file in docs directory for generating Doxygen document from source files.
- Currently only CBLAS interface of (Batched gemm and gemmt)extension APIs are included.
- Support for BLAS interface is yet to be added.
- To generate Doxygen based document for extension API, use given command.
  $ doxygen docs/Doxyfile

AMD-Internal: [CPUPL-3188]

Change-Id: I76e70b08f0114a528e86514bcb01d666acc591e8
2023-04-21 00:54:19 -04:00
Edward Smyth
b531022bac BLIS cpuid: distinguish submodels within a microarchitecture
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
  architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
  architectures, but function bli_check_valid_model_id() should
  check the provided model_id against the suitable range within
  the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
  checking of a user specified value of BLIS_ARCH_TYPE against
  the enabled configurations is delayed to a separate function,
  bli_arch_check_id().
- Default selection based on hardware can be overridden using the
  BLIS_MODEL_TYPE environment variable. Valid values are:
    Genoa, Bergamo, Genoa-X, Milan, Milan-X
  Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
  but will result in the default option for that architecture being
  selected. This is different to specifying an incorrect value of
  BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
  the --rename-blis-model-type argument to configure (or cmake
  equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
  --rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
  BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
  currently only for AMD cpus. Functions are provided to query
  these from other parts of the code, namely:
    uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()

AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702
2023-04-20 10:47:44 -04:00
Meghana Vankadari
f788618f27 Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
  for gemmt/syrk because near-to-square shaped kernel performs
  better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.

AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
2023-04-20 08:52:34 -04:00
Mangala V
5dc8e3fbca AOCL progress callback pointer update per thread
Thanks to Moore, Branden <Branden.Moore@amd.com> for identifying the
race condition and suggesting the changes to fix the same

Existing Design:
- AOCL progress callback pointer is a global pointer which is shared
  across all threads

Existing Design challenges:
 - The callback function cannot safely disable the progress mechanism,
   as another thread may have already checked to see if the function
   pointer is set, and then re-reads the pointer upon invocation of
   the callback. If one thread sets the callback to NULL in this time,
   then the resulting thread will attempt to call the null pointer as a
   function pointer, leading to a segfault.

New Design :
- Each thread maintains a local copy of progress pointer

AMD-Internal: [SWLCSG-1971]

Change-Id: I282989805a4a2a8a759a7373b645f3569bf42ed4
2023-04-20 05:33:12 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
Edward Smyth
5c58cc0546 BLIS: ACML LAPACK test failures: ZDSCAL
Correct argument alpha in call to ZDSCAL kernel function in serial
code path. This resolves numerous instances of incorrect results
in ACML LAPACK test programs when BLIS_ARCH_TYPE=generic.

AMD-Internal: [CPUPL-3227]
Change-Id: Ibf5ee79392e80c2d93a0d336a7b0e2568e149f94
2023-04-18 11:15:36 -04:00
Harihara Sudhan S
9272d3c778 Bug fix in work load distribution among the given threads
- In level-1 kernels, with multi-threading enabled, only the partial
  job was getting executed.
- The bug was in bli_thread_vector_partition and occurred only
  when minimum work for a thread >= 1 i.e., when the number of threads
  launched is less than number of elements and the number of elements
  is not a multiple of the number of threads launched.

AMD-Internal: [CPUPL-3231]
Change-Id: Ie20abb93468282cd6ac2372267714fb80c26d7cc
2023-04-18 10:16:09 -04:00
Harihara Sudhan S
1a1559380e Added AVX512 functions to the BLAS layer
- Added AVX512 function's to the BLAS layer of daxpy, dscal and
  ddot.
- Added BLAS exceptions for incx <= 0 to DSCALV
- Added BLIS_KERNELS_ZEN4 macro check to guard AVX512 kernels
  as they will not be available in other contexts.

AMD-Internal: [CPUPL-2766][CPUPL-2765][CPUPL-2793][CPUPL-2800]
Change-Id: I68860c2ff6b65624907cc1b590173f0e909bd271
2023-04-18 04:13:00 -04:00
Shubham Sharma
036da2e651 Fixed compilation errors for generic configuration
- In gemmt and normf, #ifdef BLIS_KERNELS_* is added
  to make sure only compiled kernels are used.
- In bal_copy and bla_swap, missing '\' is added.

AMD-Internal: [CPUPL-2870]
Change-Id: I83452dff761f60db6957f557321ce210ab72c037
2023-04-18 00:27:05 -04:00
Meghana Vankadari
42d05a5aa0 DGEMM: Added decision logic to choose between sup vs native for zen4 architecture
Details:
- Added a new function for choosing between SUP and
  native implementation for a given size.
- This function pointer is stored in cntx for zen4 config.
- Divided total combinations of sizes into 3 categories:
  - one dimension is small
  - Two dimensions are small
  - All dimensions are small
- Added different threshold conditions for each of the
  categories.

AMD-Internal: [CPUPL-2755]
Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf
2023-04-17 13:08:34 -04:00
Aayush Kumar
71272ab574 .Fixed Compiler warnings for GCC 12 and AOCC 4.0
- Set the variables to zero to avoid the compiler warning
  (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
  bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
  bli_trsm_small_AVX512.c

- Changed the datatype from dim_t to siz_t for i,k,j
  in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
  avoid the compiler warning (-Waggressive-loop-optimizations)

AMD-Internal: [CPUPL-2870]

Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
2023-04-14 13:29:17 +00:00
Harihara Sudhan S
32bbd96652 Moving AOCL Dynamic logic from BLIS impli layer
Threading related changes
--------------------------

- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
  logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
  rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
  DDOTV. This function contains the AOCL dynamic logic for the
  respective kernels.
- Added handling for cases when number of elements (n) is less than
  number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
  amount of work the calling thread is supposed to perform on a
  vector.

Interface changes
-----------------

- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
  kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
  and only the real part of the alpha passed is used inside the kernel.
  The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
  Without this multithreaded code might crash because of minimum work
  calculation.

Misc
-----

- Removed unused variables from ZSCAL2V and AXPYV kernels.

AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
2023-04-14 02:05:38 -04:00
Harihara Sudhan S
15bd0f9646 Added AVX512 based double and float AXPYV
- Added AVX512 based double and float AXPYV which will be used in
  Zen4 context.
- Added n <= 0 check and alpha == 0 check to the BLAS layer of
  SAXPY.
- Modified BLAS framework of float AXPYV to remove flag check and
  pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
  BLIS_KERNELS_ZEN4 macro.

AMD-Internal: [CPUPL-2793]
Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1
2023-04-14 01:06:57 -04:00
Harsh Dave
9aff4f23b3 Dynamic threading for dgemm sup.
- Following sequence is followed for getting number of threads
  for given input.

  - Divides total range of input into 3 category(m < n, m > n, m = n)
  - For each range it is further divided into 4 sub category.(K <= 32,
    K <= 64, K <=128, K > 128)
  - As per the input range, number of threads is being decided.

AMD-Internal: [CPUPL-2966]

Change-Id: I0b04e9de1615e87acb189b228544afac74664f02
2023-04-13 10:57:43 -04:00
Aayush Kumar
6ad387c2aa Added DTRSM Small Path AVX512 based LUNN/LLTN Variant Kernels
- 8x8 kernels are used for DTRSM SMALL
- Matrix A(a10) is packed for GEMM operations.
- Packed martix A will be re-used in all the col-block
  along N-dimension.
- Diagonal elements of A matrix are packed(a11) for
  TRSM operations.
- Implemented fringe cases with below block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: I5bb57501f6d3783eb654e375d63901467dd14734
2023-04-13 01:44:31 -04:00
Harihara Sudhan S
6b8f4744a4 Added AVX512 based double and float DOTV
- Added AVX512 based double and float DOTV which will be used in
  Zen4 context.
- Added n <= 0 check to the BLAS layer of SDOTV.
- Modified BLAS framework of float DOTV to remove flag check and
  pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
  BLIS_KERNELS_ZEN4 macro.

AMD-Internal: [CPUPL-2800]
Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702
2023-04-12 12:36:52 +05:30
Harihara Sudhan S
be7fb342c1 Added AVX512 based double and float SCALV
- Added AVX512 based double and float SCALV which will be used in
  Zen4 context.
- Added incx <= 0 check and alpha == 1 check to the BLAS layer of
  SSCAL.
- Modified BLAS framework of float SCAL to remove flag check and
  pick kernels based on architecture ID.
- AVX512 kernel is disabled for other Zen configurations using
  BLIS_KERNELS_ZEN4 macro.

AMD-Internal: [CPUPL-2766],[CPUPL-2765]
Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc
2023-04-11 14:41:56 +05:30
Arnav Sharma
c14ce55bcb Added SGEMM SUP blocksize override to zen4 context
- Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for
  SGEMM. This can be updated once GEMMT specific optimization are added
  for AVX-512.
- Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override
  blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for
  SGEMM operation.

AMD-Internal: [CPUPL-3060]
Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a
2023-04-10 08:16:45 -04:00
Aayush Kumar
8c537b0cd5 Added DTRSM Small Path AVX512 based LLNN/LUTN Variant Kernels
- 8x8 kernels are used for DTRSM SMALL
- Implemented fringe cases with below block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: I58d28912bddbaadb404052c0f3449ebbe3c97b68
2023-04-07 08:50:28 +00:00
Edward Smyth
16592f79ee BLIS cpuid: better support Intel and future AMD processors
Where a processor has not been targeted for optimization, the slow
generic code path may be selected by default. Another code path
could perform much better, even if not specifically optimized for
this processor.

Changes to enable this:
- Always call bli_cpuid_query_id() to get actual hardware info, even
  if the user is overriding the choice at build time or by setting
  BLIS_ARCH_TYPE.
- Use bli_cpuid_query_id() to gather availability of AVX2 and AVX512
  instructions.
- Include AVX-512 fallback test to select zen4 code path on future
  AMD processors (that don't yet have a specific codepath).
- Also use AVX-512 and AVX2 tests on Intel processors to select
  zen4 or zen3 code paths over the generic code path. If both zen
  and skx/haswell code paths are enabled, the zen code path will be
  given preference as zen paths have additional optimizations that
  should, in general, benefit Intel processors too.

AMD-Internal: [CPUPL-3031]
Change-Id: Ib7d0ebdb02fec872f9443a1d20070026f2020516
2023-04-06 11:29:33 -04:00
Edward Smyth
1885540c5a Code cleanup: compiler warning fixes
Modify code to correct some warning messages from GCC 12.2 or
AOCC 4.0:
- Increase size of nbuf in blastest/f2c/endfile.c
- Remove unused variables in kernels/zen/1/bli_scal2v_zen_int.c
  and kernels/zen/1/bli_axpyv_zen_int10.c
- Remove extraneous parentheses in frame/compat/bla_trsm_amd.c
  and kernels/zen4/3/bli_zgemm_zen4_asm_12x4.c
- Add __attribute__ ((unused)) to several variables in
  frame/1m/packm/bli_packm_struc_cxk.c and
  frame/1m/packm/bli_packm_struc_cxk_md.c

AMD-Internal: [CPUPL-2870]
Change-Id: I595e46f0a3d737beb393c3ab531717565220b10d
2023-04-06 06:56:09 -04:00
vignbala
775ce1f13c Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM
- Implemented 12x4m column preferential SUP kernels(main and fringe
  cases). The main kernel dimension is 12x4, and the associated fringe
  kernel dimensions are : 12x3m, 12x2m, 12x1m
                          8x4, 8x3, 8x2, 8x1
                          4x4, 4x3, 4x2, 4x1
                          2x4, 2x3, 2x2, 2x1.

- Included in-register transposition support for C, thus extending
  the storage scheme supports to CCC, CCR, RCC and RCR inside the
  milli-kernel.

- Integrated conditional packing of A onto the SUP front end for
  dcomplex datatype. This redirects RRC and CRC storage schemes
  onto the preceding set of SUP kernels through storage scheme
  transformation(RCC and CCC respectively).

- Updated the zen4 context file with the new set of SUP kernels, to
  get enabled appropriately. Furthermore, the context file was updated
  with the AVX-2 dotxv signatures for dcomplex datatype. This redirects
  the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines.

- Added C prefetching onto L2-cache, and an unroll factor of 4 for the
  k loop in all the kernels.

- Work in progress to include conjugate support and input spectrum
  extension for the AVX-512 SUP kernels. The current thresholds in zen4
  context is the same as that of the zen3 thresholds for ZGEMM SUP.

AMD-Internal: [CPUPL-3122]

Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e
2023-04-06 04:49:15 -04:00
Edward Smyth
75c5fd1b66 Bug fixes in ZSCALV
Fixes for a couple of issues:
- Modify alpha argument in kernel call in zscal_blis_impl to
  avoid compiler warning message about discarding 'const'
  qualifier from pointer target type.
- When BLIS_ARCH_TYPE=generic, call to bli_cntx_get_l1v_ker_dt
  in ref_kernels/1/bli_scalv_ref.c for ZSCAL was causing a
  segmentation fault. This was because the cntx was NULL on
  entry to this function. Correct by getting zscal_blis_impl
  to pass the cntx it has initialized for non-Zen codepaths.
  This changes has been copied to idamax_blis_impl for
  consistency.

AMD-Internal: [CPUPL-2773]
Change-Id: Ib02e4c1a2a7bf30c208732241d4959f7a2696179
2023-04-05 13:00:58 -04:00
Harihara Sudhan S
2e6724262e ZGEMV var 2 bug fix
- Fixed segmentation fault that was seen on non zen and non avx2
  machines.
- cntx object was not passed to the invoked kernel causing a seg
  fault.

AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
2023-04-05 01:31:24 -04:00
Edward Smyth
9b9142644f BLIS cpuid: Bugfix for BLAS1/BLAS2 when BLIS_ARCH_TYPE is set
BLAS1 and BLAS2 routines may not immediately call bli_init_auto, as
the full cntx and other global data structures may not be required
for all code paths. This may cause a problem if the user sets
BLIS_ARCH_TYPE, as it needs to check the requested value against the
available options configured in cntx. Solution: add a separate
pthread_once call to run bli_gks_init().

AMD-Internal: [CPUPL-3031]
Change-Id: Icd73a8dd161b34b23cc336623d675248f28ed23f
2023-04-04 07:54:31 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Meghana Vankadari
fb6d4b5b8b BLAS compliance for Level-3 routines
Details:
- Added checks in top-level BLAS layer for alpha = zero. Scale C with
  beta or set C with 0 appropriately if true and return. This ensures
  that input matrices are not referenced, and thus that any Inf or NaN
  values in them are not propagated to output matrix.
- Scalm internally handles the case where both alpha and beta are zero
  by calling setm to populate the output matrix with zeroes.

AMD-Internal: [CPUPL-3134]
Change-Id: I14f14419049e7fcc55bf7fd30ddbd3d5a12c4427
2023-03-29 04:36:00 -05:00
Harsh Dave
c1766e312a Added in row storage support for C matrix.
- Added in-register transpose support for c matrix to
support row stored C matrix for dgemm sup.
- Support is added for all edge case kernels.
- FMA are made independent of each other, for faster
computation while storing data back to C matrix.

AMD-Internal: [CPUPL-2966]
Change-Id: I1d13af99a17ee66adbf5f537a4664ade489a7cad
2023-03-24 02:46:21 -04:00
Meghana Vankadari
31a4203c32 Added AVX-512 based col-preferred kernels for DGEMM with optional pack framework
- Main kernel is of size 24x8 and the associated fringe kernels
  added are
  - 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m
  - 24x8,  24x7,  24x6,  24x5,  24x4,  24x3,  24x2,  24x1
  - 16x8,  16x7,  16x6,  16x5,  16x4,  16x3,  16x2,  16x1
  -  8x8,   8x7,   8x6,   8x5,   8x4,   8x3,   8x2,   8x1

- For fringe kernels, 24x? kernel handles 16 < m_remainder <  24
                      16x? kernel handles 8  < m_remainder <= 16
                       8x? kernel handles 0  < m_remainder <= 8
- Added a function 'bli_zen4_override_gemm_blkszs' to override
  blocksizes and kernels to be used for SUP for supported storage
  schemes.
- Updated the zen4 config to enable these kernels in zen4 path.
- Thresholds are yet to be derived.
- Updated CMakeLists.txt with DGEMM SUP kernels for windows build.

Kernel-specific details:
- K-loop is unrolled by 8 times to facilitate prefetch of B.
- For every load of one column of A, the corresponding column in
  next panel of A is prefetched with T1 hint.
- One column of C is prefetched with T0 hint per iteration of LOOP2.
- TAIL_NITER is derived to be 3.
- For every unroll of k-loop, one row of B is prefetched with T0 hint.
- C-prefetching for row-storage is yet to be added.
- B-prefetching for col-storage is yet to be added.
- Support for C transpose is yet to added.

AMD-Internal: [CPUPL-2755], [CPUPL-2409]
Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48
2023-03-24 06:40:36 +00:00
Meghana Vankadari
253ceffb0f Redirecting scalm to setm when alpha is zero
Added a check in scalm framework for alpha=0.
Set the output matrix to zero when alpha=0.
This ensures that any Inf or NaN values in the matrix are not
propagated to the output matrix.

AMD-Internal: [CPUPL-3053]
Change-Id: I62b9b5405be220eb4df97aadda14701abcccb475
2023-03-24 00:59:43 -04:00
Eleni Vlachopoulou
ad7a812db2 Remove quick return for zero increments.
Details:
- To be BLAS compliant, if increment is zero then iterate through the first element n times.
- For n<=0, the correct result (0) is returned so we remove this extra check. This is checked on BLIS-typed interface level.

AMD-Internal: [SWLCSG-1900]
Change-Id: I098bb9560a790050018bc8d8c63b06bfbcc1aebd
2023-03-23 23:35:03 -04:00
Shubham
dfc95d29fc Enable DTRSM small multithreading path for BLAS interface
- Enabled DTRSM small mt for sizes where performance is better
  than small or native.
- Threshold Tuning for small path is updated.
- Function signature for bli_trsm_small_mt has been made similar
  to bli_trsm_small so that one function pointer can be used for
  all functions.
- Early return condition in DTRSM small for sizes > 1000 has been
  removed so that the sizes for which small path to take can be
  decided on bla layer instead of inside kernel.

AMD-Internal: [CPUPL-2735]
Change-Id: Ieea31343dc660517acc18c92713381a8b84d3a2f
2023-03-23 12:07:22 -04:00
Edward Smyth
873b4f93fd GEMM: Early return when alpha = zero
Add test in top-level GEMM BLAS layer for alpha = zero. Scale C
appropriately if true and return. This ensures that A and B are not
referenced, and thus that any Inf or NaN values in A or B are not
propagated to C. This was tested in some lower level code paths,
but not consistently.

Solution throughout GEMM codebase assumes scalm handles scale value
(beta from GEMM) equal to 0 without propagating inf/NaNs in C matrix.

Also call AOCL_DTL exits and bli_finalize_auto() in a more consistent
way.

AMD-Internal: [CPUPL-3053]
Change-Id: I4009f311951eb1ce9416cf846e9fa93b7c9219cc
2023-03-23 09:24:41 -04:00
Aayush Kumar
5bd2a777ba Fixed Compilation Fails when configured with --disable-blas
- Moved *_blis_impl function declaration outside the BLIS_ENABLE_BLAS
  guard.
- Changed Makefile to continue to compile bla_ files to get
  *_blis_impl interfaces.
- Modify CBLAS headers, bli_macro_defs.h and bli_util_api_wrap.{c,h}
  to add BLIS_ENABLE_CBLAS guards.
- Comment out BLIS_ENABLE_BLAS guards in various headers and utility
  functions.
- Define BLIS Fortran-style functions lsame_blis_impl and
  xerbla_blis_impl. New macros PASTE_LSAME and PASTE_XERBLA are
  used in bla_*_check headers and some other places to select
  whether to call lsame and xerbla, or the _blis_impl versions.
- Defined various other missing _blis_impl functions.
- In bli_util_api_wrap.c, only define any functions if
  BLIS_ENABLE_BLAS is defined, and only define the subroutine
  versions of functions like dot, nrm2, etc if BLIS_ENABLE_CBLAS
  is defined.
- BLAS layer is needed if CBLAS layer is enabled. Changed header
  files build/bli_config.h.in and bli_blas.h, and configure
  program to help ensure consistency in generated blis.h header
  and configure output.

Undefining BLIS_ENABLE_BLAS_DEFS appears to be broken in UTA BLIS
too, thus BLIS_ENABLE_BLAS_DEFS is currently permanently defined.

AMD-Internal: [CPUPL-3015]

Change-Id: I7c0fe07db85781db46f2c690e174451860b37635
2023-03-23 06:11:52 -04:00