Commit Graph

205 Commits

Author SHA1 Message Date
Mangala V
6c1acc74c8 ZGEMM optimizations
-- Conditionally packing of B matrix is enabled in zgemmsup path
   which is performing better when B matrix is large

-- Incorporated decision logic to choose between zgemm_small vs
   zgemm sup based on matrix dimensions "m, n and k".

-- Calling of ZGEMV when matrix dimension m or n = 1.
   Very good performance improvement is observed.

Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d
2022-07-28 14:55:24 +05:30
Vignesh Balasubramanian
808d79a610 Implemented efficient ZGEMM algorithm when k=1
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:

- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.

The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.

AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
2022-07-28 02:09:45 -04:00
Vignesh Balasubramanian
2ad25a7180 ZGEMM kernel performance improvement for k=1 sizes:
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.

AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
2022-06-30 07:19:52 -04:00
satish kumar nuggu
aaf840d86e Disabled zgemm SUP path
- Need to identify new Thresholds for zgemm SUP path to avoid performance regression.

    AMD-Internal: [CPUPL-2148]

Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43
2022-06-27 08:19:12 +00:00
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Dipal M Zambare
7247e6a150 Fixed crash issue in TRSM on non-avx platform.
- Ensured that FMA, AVX2 based kernels are called only on platforms
   supporting these instructions, otherwise standard ‘C’ kernels will
   be called.
 - Code cleanup for optimization and consistency

AMD-Internal: [CPUPL-2126]
Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050
2022-05-17 18:10:39 +05:30
Dave
963a6aa099 Enabled zgemm_sup path and removed sqp path
- Previously zgemm computation failures were due to
  status variable did not have pre-defined initial
  value which resulted in zgemm computation to return
  without being computed by any kernel. Reflected
  same change in dgemm_ function as well. 

- Enabled sup zgemm as the issue is fixed with
  status variable with bli_zgemm_small call.

-Removed calling sqp method as it is disabled





Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d
2022-05-17 18:10:39 +05:30
satish kumar nuggu
1a3428ddfc Parallelization of dtrsm_small routine
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.

    AMD-Internal: [CPUPL-2103]

Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
2022-05-17 18:10:39 +05:30
Harsh Dave
52e4fd0f11 Performance Improvement for ctrsm small sizes
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.

Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
2022-05-17 18:10:39 +05:30
Sireesha Sanga
cc3069fb5e Performance Improvement for ztrsm small sizes
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
  ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
  sizes 64<= m,n <= 256, by keeping the number of
  threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
  used for all variants.

AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
2022-05-17 18:10:39 +05:30
satish kumar nuggu
fe7f0a9085 Changes to enable zgemm small from BLAS Layer
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.

Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
2022-05-17 18:10:38 +05:30
Sireesha Sanga
9621ef3067 Performance Improvement for ztrsm small sizes
Details:
- Enable ztrsm small implementation
- For small sizes, Right Variants and Left Unit Diag
  Variants are using ztrsm_small implementations.
- Optimization of Left Non-Unit Diagonal Variants,
  Work In Progress

AMD-Internal: [SWLCSG-1194]
Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4
2022-05-17 18:09:22 +05:30
Harsh Dave
0976ed9ce5 Implement zgemm_small kernel
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.

AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
2022-05-17 18:09:22 +05:30
Nallani Bhaskar
eb0ff01871 Fine-tuning dynamic threading logic of DGEMM for small dimensions
Description:
    1. For small dimensions single threads dgemm_small performing
       better than dgemmsup and native paths.
    2. Irrespecive of given number of threads we are redirecting
       into single thread dgemm_small

       AMD-Internal:[CPUPL-2053]

Change-Id: If591152d18282c2544249f70bd2f0a8cd816b94e
2022-05-17 18:09:22 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30
Meghana Vankadari
c11fd5a8f6 Added functionality support for dzgemm
AMD-Internal: [SWLCSG-1012]
Change-Id: I2eac3131d2dcd534f84491289cbd3fe7fb7de3da
2022-05-17 18:01:55 +05:30
Dipal M Zambare
f63f78d783 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-01-18 11:51:08 +05:30
Dipal M Zambare
fd8a3aace9 Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2021-11-23 10:29:15 +05:30
mkurumel
235071690a Fixed dynamic dispatch crash issue on non-zen architecture for gemv and axpy routines.
Summary:
    1. This commit fixed issue for gemv and axpy API’s.
    2. The BLIS binary with dynamic dispatch feature was
    crashing on non-zen CPUs (specifically CPUs without
    AVX2 support).
    3. The crash was caused by un-supported instructions
    in zen optimized kernels.The issue is fixed by calling
    only reference kernels if the architecture detected at
    runtime is not zen, zen2 or zen3.

Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
2021-11-12 08:58:59 +05:30
Harihara Sudhan S
4de6f2ca6d Fixed dynamic dispatch crash issue on non-zen architecture.
Direct calls to zen kernels replaced by architecture
dependent calls for dotv and amaxv kernels. For non-zen
architecture, generic function is called using the BLIS
interface. For zen architecture, direct calls to zen
optimized kernels are made.

Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f
2021-11-12 08:58:58 +05:30
Harsh Dave
8f297f6267 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
2021-11-12 08:58:58 +05:30
Dipal M Zambare
61b5b9c4d0 Fixed dynamic dispatch crash issue on non-zen architecture.
Removed direct calling of zen kernels in cblas source itself.
Similar optimizations are done by the function directly invoked from
Cblas layer.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea
2021-11-12 08:58:58 +05:30
Dipal M Zambare
ddbdfd0ba4 Fixed dynamic dispatch crash issue on non-zen architecture.
This commit fixed issue for gemm and copy API’s.

The BLIS binary with dynamic dispatch feature was crashing on non-zen
CPUs (specifically CPUs without AVX2 support).
The crash was caused by un-supported instructions in zen optimized kernels.
The issue is fixed by calling only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]

Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366
2021-11-12 08:58:57 +05:30
lcpu
30038af896 Reverted: To fix accuracy issues for complex datatypes
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
2021-11-12 08:58:57 +05:30
Harsh Dave
53e1d0539f Fixed conjugate transpose kernel issue
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.

Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
2021-11-12 08:58:56 +05:30
Harsh Dave
2b6faf21a1 Fixed conjugate transpose issue for zscalv and cscalv
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.

Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
2021-11-12 08:58:55 +05:30
Harsh Dave
590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
satish kumar nuggu
23278627f4 STRSM small kernel implementation
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels

Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
Dipal M Zambare
8f310c3384 AOCL DTL - Added thread and execution time details in logs
-- Added number of threads used in DTL logs
    -- Added support for timestamps in DTL traces
    -- Added time taken by API at BLAS layer in the DTL logs
    -- Added GFLOPS achieved in DTL logs
    -- Added support to enable/disable execution time and
       gflops printing for individual API's. We may not want
       it for all API's. Also it will help us migrate API's
       to execution time and gflops logs in stages.
    -- Updated GEMM bench to match new logs
    -- Refactored aocldtl_blis.c to remove code duplication.
    -- Clean up logs generation and reading to use spaces
       consistently to separate various fields.
    -- Updated AOCL_gettid() to return correct thread id
       when using pthreads.

AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
2021-11-12 08:58:54 +05:30
Harsh Dave
a7f600b3a4 Implemented ztrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
2021-11-12 08:58:54 +05:30
Nallani Bhaskar
40e1fd1860 Enabled processing of zero inputs of x in tbsv to provide NaN
Details:
The basic idea is inverse of singular matrix doesn't exist,
therefore we should be returning NAN. BLAS standard and BLIS
is optimizing by not doing any compute when x[j]== 0. As a
result BLIS is generating finite values for inverse
calculation of singular matrices which in reality is not
the right answer. Fix is provided in this commit to generate
NAN/INF values incase this API is called to compute inverses
of singular matrices. But according to the standard, this API
shouldn't be called in the first place, the check for singularity
or near singularity should be done by the calling application

Change-Id: Iccdbc07744de3892626f4066ee4a63eb30bc06cd
2021-11-12 08:58:53 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
Madan mohan Manokar
3dda4ebf22 Induced method turned off, fix for beta=0 & C = NAN
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.

Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
2021-11-12 08:58:50 +05:30
Meghana Vankadari
f1291fc957 Disabled dzgemm blas interfaces
AMD-Internal: [CPUPL-1825]
Change-Id: I786a2ad83be418bd4c55d24e30ae6c3a86fd8844
2021-11-12 08:58:50 +05:30
Saitharun
ffcb338531 Adding ENABLE_WRAPPER CMAKE OPTION
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.

AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
2021-11-12 08:58:49 +05:30
Saitharun
5d6012c50d Added additional symbols for BLIS APIs using wrapper functions
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.

Change-Id: Ibb06121821ab937b25d492409625916f542b2135
2021-11-12 08:58:48 +05:30
Dipal M Zambare
bcd9591b3f Added support for amdepyc fat binary
-- Created new configuration amdepyc to include fat binary which
     includes zen, zen2, zen3 and generic architecture for fallback.

  -- Updated amdepyc family makefiles to include macros needed
     in amdepyc family binary. This file must include all macros,
     compiler options to be used for non architecture specific code.

  -- Added 'workaround' to exclude ZEN family specific code in some of
     the framework files. There are still lot of places were ZEN family
     specific code is added in framework files. They will be addressed
     with proper design later.
       - Moved definition of BLIS_CONFIG_EPYC from header files to
         makefile so that it is enabled only for framework and kernels

  -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
     wherever it was needed.

  -- Removed un-used, obsolete macros, some of them may be needed for
     debugging which can be added in the individual workspaces.
       - BLIS_DEFAULT_MR_THREAD_MAX
       - BLIS_DEFAULT_NR_THREAD_MAX
       - BLIS_ENABLE_ZEN_BLOCK_SIZES
       - BLIS_SMALL_MATRIX_THRES_TRSM
       - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
       - BLIS_ENABLE_SUP_MR_EXT
       - BLIS_ENABLE_SUP_NR_EXT

  -- Corrected implementation of exiting amd64_legacy configuration.

AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
2021-08-26 15:13:49 +05:30
Kiran Varaganti
adfd569591 DGEMM Optimizations
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed  N > 3  condition from bli_dgemm_small.

Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
2021-08-10 12:34:43 -04:00
Madan mohan Manokar
4b90ae3112 single instance zgemm tuning
1. single instance case sup is enabled.
2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning.

AMD-Internal: [CPUPL-1743]
Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec
2021-07-29 13:36:14 +05:30
Madan mohan Manokar
d3542ff0e0 3m_sqp conjugate support added
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.

AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
2021-07-05 18:44:55 +05:30
Madan mohan Manokar
70e9d327a2 squarePacked(sqp) framework and multi-instance handling
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.

AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
2021-06-28 15:40:11 +05:30
Meghana Vankadari
cb3a40ab9d Added blas interface for dzgemm
- Added blas interface for dzgemm. This function will call
  native implementation of gemm.
- Mixed datatype support is already present in BLIS. But this
  implementation requires alpha_imag value to be 0.
- Modified test_gemm.c to support testing of dzgemm.

Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334
2021-06-27 09:34:18 -04:00
Nallani Bhaskar
75f72b7f6e Added aocl dynamic feature for dtrsm for small sizes
Details:
1. Added aocl-dynamic for dtrsm native path
   When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
   trsm_small is doing better than native irrespective of
   number of threads

Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
2021-06-18 08:46:47 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Kiran Varaganti
c2abbcab96 Fix dgemm_ Multi-thread running as Single Thread
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.

Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
2021-06-15 12:14:11 +05:30
Dipal M Zambare
2f344f5df1 DTL logs corrections
-- Fixed issues in printing the values of
     side, uploa and diaga parameters for
     hemm, hemv, her, her2, her2k, herk,
     symm, symv, syr, syr2, syr2k, syrk,
     trmm, trmv, trsm, trsv.
  -- For above API's logging was called with MKSTR()
     for side, uploa and diaga parameters. MKSTR is
     needed only for macro arguments but not
     for function's arguments.
  -- Added space between function name and data type
     where it was missing. Bench expects logs in
     this format.

AMD-Internal: [CPUPL-1585]
Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d
2021-06-04 15:24:13 +05:30
Nageshwar Singh
2e1a5bc1dd Optimized double complex axpyf kernel for zgemv
Details:
  - Implemented zaxpyf kernel with fuse factor=4 for zgemv.
  - Modified BLAS interface call for zgemv to reduce framework overhead.
  - Directed gemv to dotv in the case where dimension of y vector is 1.
  - when alpha = 0, gemv becomes scalv of Y with beta. Added code to
    return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
2021-06-01 18:03:29 +05:30
Meghana Vankadari
4446395047 Redirecting dgemv to axpyf based implementation for smaller sizes.
AMD-Internal: [CPUPL-1403]
Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994
2021-05-25 01:39:12 -04:00
satish kumar nuggu
82087773a0 Optimized single threaded dtrsm small for right cases
Details:

1. Added optimized dtrsm kernels for all 8 right side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a01 block required for GEMM operation. The block size
      increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
      from one end of A to other end of triangular A
   b. Packing of 6 diagonal elements in one location helped to utilize
      cache line efficiently

      AMD-Internal: [CPUPL-1563]

Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
2021-05-25 01:09:50 -04:00