Commit Graph

2557 Commits

Author SHA1 Message Date
Dipal M Zambare
3364c0e4eb Binary and dynamic dispatch configuration name change
-- Reverted changes made to include lp/ilp info in binary name
     This reverts commit c5e6f885f0.

  -- Included BLAS int size in 'make showconfig'

  -- Renamed amdepyc configuration to amdzen

Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
2021-11-12 08:58:56 +05:30
Dipal M Zambare
ae844f475c Fixed build issue in DTL when only traces are enabled.
AMD-Internal: [CPUPL-1691]
Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008
2021-11-12 08:58:56 +05:30
Harsh Dave
53e1d0539f Fixed conjugate transpose kernel issue
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.

Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
2021-11-12 08:58:56 +05:30
Harsh Dave
2b6faf21a1 Fixed conjugate transpose issue for zscalv and cscalv
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.

Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
2021-11-12 08:58:55 +05:30
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30
Madan mohan Manokar
d6fcfe7345 gemmt SUP limitThread count for small sizes
1. Max thread cap added for small dimension based on product(n*k).

AMD-Internal: [CPUPL-1388]

Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057
2021-11-12 08:58:55 +05:30
Harsh Dave
590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
satish kumar nuggu
23278627f4 STRSM small kernel implementation
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels

Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
Dipal M Zambare
8f310c3384 AOCL DTL - Added thread and execution time details in logs
-- Added number of threads used in DTL logs
    -- Added support for timestamps in DTL traces
    -- Added time taken by API at BLAS layer in the DTL logs
    -- Added GFLOPS achieved in DTL logs
    -- Added support to enable/disable execution time and
       gflops printing for individual API's. We may not want
       it for all API's. Also it will help us migrate API's
       to execution time and gflops logs in stages.
    -- Updated GEMM bench to match new logs
    -- Refactored aocldtl_blis.c to remove code duplication.
    -- Clean up logs generation and reading to use spaces
       consistently to separate various fields.
    -- Updated AOCL_gettid() to return correct thread id
       when using pthreads.

AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
2021-11-12 08:58:54 +05:30
Harsh Dave
a7f600b3a4 Implemented ztrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
2021-11-12 08:58:54 +05:30
Nallani Bhaskar
40e1fd1860 Enabled processing of zero inputs of x in tbsv to provide NaN
Details:
The basic idea is inverse of singular matrix doesn't exist,
therefore we should be returning NAN. BLAS standard and BLIS
is optimizing by not doing any compute when x[j]== 0. As a
result BLIS is generating finite values for inverse
calculation of singular matrices which in reality is not
the right answer. Fix is provided in this commit to generate
NAN/INF values incase this API is called to compute inverses
of singular matrices. But according to the standard, this API
shouldn't be called in the first place, the check for singularity
or near singularity should be done by the calling application

Change-Id: Iccdbc07744de3892626f4066ee4a63eb30bc06cd
2021-11-12 08:58:53 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
satish kumar nuggu
3bafdf3923 DGEMM kernel implementation for case k = 1.
Details :
      - DGEMM kernel implementation for case k = 1, vectorized with 8x6 block implementation (Rank-1 update in DGEMM Optimization).

Change-Id: I7d06378adeb8bcc5b965e2a94314d731629d0b4c
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
mkurumel
9f1ce594a5 BLIS : Compiler warning fixes
Details :
  - Fixed warnings with AOCC and GCC compilers.

AMD-Internal: [CPUPL-1662]

Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce
2021-11-12 08:58:52 +05:30
Meghana Vankadari
9a5b15da68 Disabled dzgemm_ function call in test_gemm.c
Change-Id: I31da12cbaf50cf7fe44baf97de3c39896c4ccfb1
2021-11-12 08:58:52 +05:30
Meghana Vankadari
10ca8710f0 Optimized SUP code for GEMMT
Details:
- Eliminated the IR loop in ref_var2m functions.
- Handled the rectangular and triangular portions of C matrix
  separately.
- Added a condition to check and eliminate zero regions inside IC loop.
- modified kc selection logic to choose optimal KC in SUP
- Updated thresholds to choose between SUP and native.

Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724
2021-11-12 08:58:52 +05:30
Nageshwar Singh
faeb79f2b9 Trsm bench utility missmatch DTL logs and bench
AOCL-Internal: [CPUPL-1585]
Change-Id: I2896d695e6bb40ec39a4f840240499927de16962
2021-11-12 08:58:52 +05:30
Dipal M Zambare
5d287fdba0 Include LP64/ILP64 in BLIS binary name
Binary name will be chosen based on multi-threading and BLAS
  integer size configuration as given below.

  libblis-[mt]-lp64 - when configured to use 32 bit integers
  libblis-[mt]-ilp64 - when configured to use 64 bit integers

AMD-Internal: [CPUPL-1879]
Change-Id: I865023c63235a0a72bdfce7057b2cfb8158b1d87
2021-11-12 08:58:51 +05:30
Dipal M Zambare
a8e47a82c0 Fixed build issue with clang in 'make checkcpp'
The makefile in vendor/testcpp folder has hardcoded g++
as cpp compiler and linker. Updated the makefile to take
the compiler which is used to build the library.

AMD-Internal: [CPUPL-1873]
Change-Id: Ib0bcbb8fccd0ff6f90b49b3b1e4a272cf3bad361
2021-11-12 08:58:51 +05:30
Kiran Varaganti
7196b86f05 Removed packm kernels of zen
intrinsic optimized packm kernels written for zen are no longer used.
Therefore removing it. Currently packm kernels from haswell configuration
are being used for zen2 and zen3 configs.
2021-11-12 08:58:51 +05:30
Kiran Varaganti
5ecb4fd0db Removed dead code
This sanity check (checking top_index != 0) which has been disabled
 earlier in  bli_pool_reinit() by commenting out bli_abort() was unnecessarily
computing top_index - this whole statement is commented out.

Change-Id: If296754ca8cba3a69d023d4a7ec891f1cbce1d6a
2021-11-12 08:58:51 +05:30
Kiran Varaganti
09bfe3e372 Improve DGEMM sup for RCR case
When m and n are very large values, larger KC is preferred
to reduce L3 cache misses, therefore increased KC value to KC0. This
improved performance of DGEMM considerably on EPYC processors
Kindly note (PACKA and PACKB disabled).

Change-Id: I7b3f5b53a01fca0e55bc4479eeb644c7001d3463
2021-11-12 08:58:51 +05:30
Kiran Varaganti
628f4a41b8 Fixed bug in packing API
bli_pack_set_pack_b() API wrongly inits pack_a element of rntm object
instead of pack_b. Fixed this bug.

Change-Id: I267493ab3ff0bade478d1157799a6fa5b51c7970
2021-11-12 08:58:50 +05:30
Madan mohan Manokar
3dda4ebf22 Induced method turned off, fix for beta=0 & C = NAN
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.

Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
2021-11-12 08:58:50 +05:30
Meghana Vankadari
f1291fc957 Disabled dzgemm blas interfaces
AMD-Internal: [CPUPL-1825]
Change-Id: I786a2ad83be418bd4c55d24e30ae6c3a86fd8844
2021-11-12 08:58:50 +05:30
Nallani Bhaskar
15b7fff159 Fixed reading C when beta=0 in few sgemm asm kernels
Description:

1. When beta is zero we should not be doing any arthemetic operation
   on C data and should not assume anything on values of C matrix

2. It is taken care in all sgemmsup kernels already except
   in bli_sgemmsup_rv_zen_asm_2x8 and bli_sgemmsup_rv_zen_asm_3x8
   kernels, when beta zero and C is column storage case. Fixed
   this issue by removing reading C matrix in these kernels.

3. When C has NaN or Inf and when we multiply NaN or Inf with zero
   (beta) the result becomes NaN only.

Change-Id: I3fb8c0cd37cf1d52a7909f6b402aa9c40c7c3846
2021-11-12 08:58:50 +05:30
Saitharun
ffcb338531 Adding ENABLE_WRAPPER CMAKE OPTION
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.

AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
2021-11-12 08:58:49 +05:30
Meghana Vankadari
5c770cafee Removed syrk_small code
- The current implementation of syrk_small computes the entire C matrix
  rather than computing triangular part. This implementation is not
  efficient.

AMD-Internal: [CPUPL-1571]
Change-Id: I9a153207471a55e52634429062d18ba1a225fed9
2021-11-12 08:58:49 +05:30
Meghana Vankadari
7bbb7ee7f2 Added weighted thread distibution for SUP GEMMT/SYRK
Change-Id: Ia080b8a76e788d923bb3545b3f8f97e39f85cebf
2021-11-12 08:58:49 +05:30
Dipal M Zambare
d1fb770f1c Removed duplicate cpp and testcpp folders.
The cpp and testcpp folder exists in root directory as well as vendor
directory. Only folder in vendor directory are needed.

Removed duplicate directories and updated makefiles to pick the
sources from vendor folder.

AMD-Internal: [CPUPL-1834]
Change-Id: I178043a09fd746660938b89ecce73c53d6c53409
2021-11-12 08:58:49 +05:30
Dipal M Zambare
41a86c2463 Enabled znver3 flag for gcc version above 11
-- Added -march=znver3 flag if the library is built for zen3
   configuration with gcc compiler version 11 or above.

-- Replaced hardcoded compiler names 'gcc' and 'clang' with
   variable $CC so that options are chosen as per the compiler
   specified at configure time (instead of compiler in path).

AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
2021-11-12 08:58:48 +05:30
Dipal M Zambare
f698afc567 Updated version number to 3.1.0
AMD-Internal: [CPUPL-1811]
Change-Id: I6b485e7622e526791094ae621d9f84d2526e6569
2021-11-12 08:58:48 +05:30
Saitharun
5d6012c50d Added additional symbols for BLIS APIs using wrapper functions
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.

Change-Id: Ibb06121821ab937b25d492409625916f542b2135
2021-11-12 08:58:48 +05:30
Abhiram S
7787bc79b1 Level1 samaxv: AVX512 implementation
Details:
 1. Unrolled by a factor 5. This gave around 1GFLOPS gain
 2. Changed CMP to subs and remove nan. CMP uses a lot of
    compare, which is higher in latency and more number of
    instructions. Replacing with subs and remove nan
    reduced it to 3 instructions and lighter ones.
 3. Added remove nan function.
 4. Added AVX512 definition in skx context.
 5. Disabled code in AMAXV kernel depending on AVX512 flag
    exists or not

Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
2021-09-27 16:10:08 +05:30
Harihara Sudhan S
bdb5e32176 Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2021-09-02 09:47:08 +05:30
Dipal M Zambare
bcd9591b3f Added support for amdepyc fat binary
-- Created new configuration amdepyc to include fat binary which
     includes zen, zen2, zen3 and generic architecture for fallback.

  -- Updated amdepyc family makefiles to include macros needed
     in amdepyc family binary. This file must include all macros,
     compiler options to be used for non architecture specific code.

  -- Added 'workaround' to exclude ZEN family specific code in some of
     the framework files. There are still lot of places were ZEN family
     specific code is added in framework files. They will be addressed
     with proper design later.
       - Moved definition of BLIS_CONFIG_EPYC from header files to
         makefile so that it is enabled only for framework and kernels

  -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
     wherever it was needed.

  -- Removed un-used, obsolete macros, some of them may be needed for
     debugging which can be added in the individual workspaces.
       - BLIS_DEFAULT_MR_THREAD_MAX
       - BLIS_DEFAULT_NR_THREAD_MAX
       - BLIS_ENABLE_ZEN_BLOCK_SIZES
       - BLIS_SMALL_MATRIX_THRES_TRSM
       - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
       - BLIS_ENABLE_SUP_MR_EXT
       - BLIS_ENABLE_SUP_NR_EXT

  -- Corrected implementation of exiting amd64_legacy configuration.

AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
2021-08-26 15:13:49 +05:30
Meghana Vankadari
5cf260f1ec Fixed a bug in trsm blocksize determining function.
- The function "bli_determine_blocksize_b_sub" uses a modulo operation
  using b_alg. In cases where trsm blocksizes are not set, using this
  function with trsm blocksizes being zero will lead to FPE.
- Added a check to avoid calling the above function with zeroes as
  parameters.

Change-Id: I770aaa13125b55320a68ff9fc3da782111e0978a
2021-08-26 15:10:41 +05:30
satish kumar nuggu
5adb7bf1a4 Combined variants to reduce redundancy in dtrsm small
1. Left Lower non-trans,Left Upper trans
2. Left Upper non-trans,Left Lower trans
3. Right Lower non-trans.Right Upper trans
4. Right Upper non-trans,Right Lower trans

Change-Id: I0b0155d7c3a55ec74d53c8f1f49f1bceb63b15f5
2021-08-26 15:04:14 +05:30
Meghana Vankadari
107eaea237 Removed dependency on AOCL_BLIS_ZEN for TRSM blocksizes.
Details:
- The field "trsm_blkszs" in cntx enables us to set and use blocksizes
  that are specific to trsm only instead of using commom L3 blocksizes.
- While querying blocksizes for TRSM, Added a check to see if
  trsm-specific blocksizes are set, if they are not set, we continue
  trsm execution by using common Level-3 blocksizes.
- TRSM-specific blocksizes( if required ) needs to be set in
  cntx_init function of respective configuration.

Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97
2021-08-16 00:12:33 -04:00
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Dipal Madhukar Zambare
0cb552c8f8 Merge "Updated memory pool implementation to contain buffers of different sizes." into amd-staging-milan-3.1 2021-08-12 09:01:02 -04:00
Dipal M Zambare
2eede504b5 Updated memory pool implementation to contain buffers of different sizes.
-- In existing design memory pool supports buffers of only one
     size, This size is determined at compile time to support buffer
     needed for biggest block size and data type. However, the size
     calculation is not generic as it considers sizes only for GEMM.
     Also it assumes that the buffer will not be used for any other
     purpose than packing operations.

  -- If the new buffer is requested whose size is bigger than existing
     size, The pool is re-initialized to contain buffers of new size,
     however, this is done only if the pool is empty. If the pool is not
     empty the execution is aborted. This is undesirable as in
     mulithreaded scenarios it is possible that different threads needs
     buffers of different sizes at the same time (i.e. while other thread
     is still in middle of the operation).

  -- This commit removes the restriction of single size buffer, when
     new buffer is requested whose size if bigger than exiting one, no
     re-init is done. New buffer of required size is allocated, added
     to the pool and returned to the user.

Change-Id: I20acdb60eb06ab2e53366d51713aa83c4b2df0da
2021-08-12 10:46:27 +05:30
Kiran Varaganti
adfd569591 DGEMM Optimizations
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed  N > 3  condition from bli_dgemm_small.

Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
2021-08-10 12:34:43 -04:00
Meghana Vankadari
170719e647 Fixed few bugs in GEMMT for non-zen configs
Details:
- Added a check condition in GEMMT native path to choose
  update_triang routines based on whether the kernels are
  row-preferred or column preferred.
- Moved the zen-specific  SUP thresholds under BLIS_CONFIG_EPYC
  macro. For non-zen architectures, it falls back to L3 SUP
  thresholds.
- Modified SUP code path to always choose 2m variant. 1n variant
  is not implemented for GEMMT.

Change-Id: Ifdd55815c588f645e337de80b5b9d1864f6b5dd3
2021-08-10 02:38:17 -04:00
Nallani Bhaskar
29e7f08eb2 Updated aocldtl loging with fflush
Description:
1. Added fflush after fprintf in aocldtl so avoid mising any log
   info when application crashes
2. Removed one redundent if def statement

Change-Id: I79a645060fa36bd8c7e5eaca4d9183dc944329ea
2021-08-09 19:14:50 +05:30
Nallani Bhaskar
40a5a614b2 Updated vector loads/stores in reminder cases to avoid access beyond the matrix boundary
Description:
1. While processing reminder cases in bli_trsm_small algorithm
   there were few loads and stores which were accessing
   beyond the given matrix buffer because of vectorized instructions.

2. Modified 256bit vector loads at edges into 128bit or 64 bit loads/stores
   such that no read/write happens beyond the matrix boundary.

AMD-Internal: [CPUPL-1759] [SWLCSG-819]

Change-Id: Iba51d0ed9bb28d1b0948a219755b8dbcc86a7fa9
2021-08-09 10:51:26 +05:30
Madan mohan Manokar
4b90ae3112 single instance zgemm tuning
1. single instance case sup is enabled.
2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning.

AMD-Internal: [CPUPL-1743]
Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec
2021-07-29 13:36:14 +05:30
Manideep Kurumella
33fd2f7398 Merge " BLIS : DGEMV performance improvement for incy/incx greater than 1" into amd-staging-milan-3.1 2021-07-21 10:23:47 -04:00