Commit Graph

2256 Commits

Author SHA1 Message Date
Mangala V
9e147912ee Merge "ZGEMM SUP: Removed unused assembly intructions" into amd-staging-milan-3.1 2021-04-19 03:08:31 -04:00
managalv
6c3741cd3e ZGEMM SUP: Removed unused assembly intructions
Removed memory operations which were being unused
Modified labels to be unquie to a file
Rowstride update is done at once to avoid multiple mul instruction

AMD Internal  : [CPUPL-1419]

Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103
2021-04-19 20:31:03 +05:30
Manideep Kurumella
3c2c8157c9 Merge "SGEMV performance improvement." into amd-staging-milan-3.1 2021-04-12 01:22:48 -04:00
mkurumel
f8525a888e SGEMV performance improvement.
1.bli_sdotxf_zen_int_8 :
               added hadd_ps intrinsic instead of dp_ps for
               add partial dot outputs.

AMD Internal  : [CPUPL-1512]

Change-Id: I6e8e71a9cf8c1f30a1710dd1c67f193a998beb03
2021-04-12 10:47:23 +05:30
Madan mohan Manokar
997133dc11 sup zgemm improvement
1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m.
2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done.

Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e
AMD-Internal: [CPUPL-1352]
2021-04-09 01:45:04 -04:00
Kiran Varaganti
a7b7fcae59 Merge "Fix test_dotv.c when complex-return=intel" into amd-staging-milan-3.1 2021-04-06 10:01:45 -04:00
Kiran Varaganti
657f58b82b Fix test_dotv.c when complex-return=intel
When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes.
Now "return parameter" will become the first argument of these functions and these functions return void.
This fix addresses the change in the function declarations when --complex-return=intel is enabled.
Missing Trace and Log statements for this configuration are now  added.
[CPUPL-1376]

Change-Id: Ib420989da71839211c16088bf431a2ad775a3978
2021-04-06 15:38:12 +05:30
Madan Mohan Manokar
2aad3fbe55 Merge "disabled zgemm induced and gemm sqp temporarily." into amd-staging-milan-3.1 2021-04-05 05:51:12 -04:00
Madan mohan Manokar
7112b73d0d disabled zgemm induced and gemm sqp temporarily.
1. mx1, mx4 kernel addition and framework modification.
2. 8mx6n kernel addition.
3. NULL check added in dgemm_sqp malloc.
4. mem tracing added.
5. Restricted 3m_sqp to limited matrix sizes.
6. Induced methods disabled temporarily for debug.

AMD-Internal: [CPUPL-1352]
Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2
2021-04-04 20:43:03 +05:30
Dipal M Zambare
5562a27823 Added check for zero dimensions and early return in ?gemv and ?scal API.
If the one of the passed dimensions is zero, these API's will perform
early return and avoid crashes in case any other pointer inputs are null.

AMD-Internal: [SWLCSG-602]
Change-Id: Ibe8902beef286410707a2a88e94b933b49975c85
2021-03-26 21:08:22 +05:30
Madan Mohan Manokar
4f19ef8339 Merge "3m_sqp vectorization" into amd-staging-milan-3.1 2021-03-10 02:03:23 -05:00
Madan mohan Manokar
a424e8b426 3m_sqp vectorization
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.

AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
2021-03-10 11:54:50 +05:30
nphaniku
e3cc577ec1 AOCL Windows: 3.1 BLIS changes
1. Incorporated code review comments .
 2. Updated Copyright to 2021.

AMD Internal : [CPUPL-1422]

Change-Id: I722b0f71daae029a3dcc2cbd029524ea39ca78e6
2021-03-09 17:35:57 +05:30
nphaniku
d78defa0fc AOCL Windows: 3.1 BLIS changes
1. CMake script changes for adding new files to the build.
 2. Added Upper case support for couple of API's.
 3. bool is not support in clang so defined it.

AMD Internal : [CPUPL-1422]

Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
2021-03-08 22:32:13 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Kiran Varaganti
12d13629f9 Fix Debug Trace Log in dgemm_ and zgemm_
Replaced "*MKSTR(ch)" in the DTL call "AOCL_DTL_LOG_GEMM_INPUTS(AOCL_DTL_LEVEL_TRACE_1, *MKSTR(ch)...)" with "D" and "Z" for dgemm_ and zgemm_ respectively to prevent printing wrong data-type.

[CPUPL-1449]

Change-Id: Ic91537189352bdb164411799e127de990a5c9a08
2021-03-02 15:16:21 +05:30
Nageshwar Singh
791903b31c Adding trans h support in bench_gemm.c
Change-Id: If340d515c38a593df26d5075e29685ef044601a5
2021-03-02 02:33:06 +05:30
Meghana Vankadari
22d4689360 Implemented 16x3 based gemm kernel for the case where A has transpose
Details:
- This implementation does a transpose operation while packing 16xk of A
  buffer and passes it to 16x3-nn kernel.
- The same implementation works for the case where B has transpose.

AMD-Internal: [CPUPL-1376]
Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17
2021-02-18 16:47:14 +05:30
Kiran Varaganti
851ab8b39f Merge "Code fixes for single-thread and multi-thread builds." into amd-staging-milan-3.1 2021-02-16 23:29:41 -05:00
Kiran Varaganti
e1a5e96c7f Code fixes for single-thread and multi-thread builds.
Made changes to dgemm_ and zgemm_ interfaces to support multi-thread GEMM implementations. When number of threads is greater than one, we call  multi-threaded gemm (sup or native) and for single thread version we call different flavors of single-thread gemm implementations decided based on the matrix dimensions.
[CPUPL-1376]

Change-Id: I2e37145ec9a07d6b7e7be1719bd49239e813aa8a
2021-02-16 12:44:31 +05:30
Meghana Vankadari
cf7d9c7314 Disabled calling of bli_dgemm_small from gemm_front
Details:
- Decision logic to choose small_gemm has been moved to blas interface.
- Redirecting all the calls to small_gemm from gemm_front to native
  implementation.

AMD-Internal: [CPUPL-1376]
Change-Id: I6490f67113e9f7c272269f441c86f2a0b3c89a53
2021-02-16 11:30:20 +05:30
Madan mohan Manokar
95e0fb3a05 sqp commenting
1. Added comments.

AMD-Internal: [CPUPL-1429]
Change-Id: Ie37e24e58cd8bf836038a2258ebd09c3912fab9e
2021-02-15 04:22:23 -05:00
Meghana Vankadari
42a0a6bc6f Added a basic dgemm implementation for smaller matrices.
Details:
- This kernel works best for cases where k = 1.
- This implementation is called directly from blas interface when A, B
  matrices have no-transpose and k = 1.

AMD-Internal: [CPUPL-1376]
Change-Id: I3b31673a28290c81d4a4cb64c8605d56e50b5d3d
2021-02-15 09:43:47 +05:30
Meghana Vankadari
943b1362c7 Enabled vectorized pack kernels for zen2 configuration.
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
  implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.

AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
2021-02-12 19:16:57 +05:30
Madan mohan Manokar
4c8b823972 gemm_sqp(gemm_squarePacked): 3m_sqp and dgemm_sqp
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
    since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.

Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
2021-02-12 15:57:59 +05:30
Kiran Varaganti
29ddec241a Merge "DGEMM Optimizations for smaller dimensions" into amd-staging-milan-3.1 2021-02-11 08:22:36 -05:00
Kiran Varaganti
a7d43cf720 DGEMM Optimizations for smaller dimensions
Modified dgemm_ to able to call small_gemm 16x3 kernel.
small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2.
small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel.

[CPUPL - 1376]

Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b
2021-02-11 11:05:42 +05:30
Mangala V
503e912fc5 Merge "Modified blas interface of TRSM to call TRSV whenever m=1 or n=1." into amd-staging-milan-3.1 2021-02-11 00:21:45 -05:00
managalv
8face536fd Modified blas interface of TRSM to call TRSV whenever m=1 or n=1.
TRSM API: AX = B, where X=B
  Case1: Call TRSV when matrix B is vector & A is matrix,
         When n = 1 for left side and when m = 1 for right side
  Case2: Divide B/A when matrix B is vector & A is scalar(Diagonal element),
         When m = 1 for left side and when n = 1 for right side
  For right side, Transpose complete operation, Change upper to lower and
                  vice versa when A is being transposed

Change-Id: Ib020f2a568f04a6e8d8f75bfc38adbfd7c5d175a
2021-02-11 18:47:37 +05:30
Madan mohan Manokar
3ab9104dae Handling zgemm real(+/-1) alpha and beta
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.

Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
2021-02-10 02:58:58 -05:00
managalv
1ff4981203 Modified blas interface of TRSM to call TRSV whenever m=1 or n=1.
Case1: Call TRSV when matrix C & B are vector & A is matrix,
         When n = 1 for left side and when m = 1 for right side
  Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
         When m = 1 for left side and when n = 1 for right side
  For right side, Transpose complete operation, Change upper to lower and
                  vice versa when A is being transposed

Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
2021-02-08 19:02:25 +05:30
Madan mohan Manokar
f1ea1f1d34 Adpative zgemm
1. 3m1 choosen for (m<=128) &  (68>n<=128) & (k<=128)
2. Default blis3.1 path for rest of the sizes.

Change-Id: I1e50dece013e72a67f1162faef5cbeb9bfbbc23a
AMD-Internal: [CPUPL-1352]
2021-02-03 12:43:57 +05:30
Meghana Vankadari
2e7cf8d82f Added 16x4 AXPYF kernel for zen2 config
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.

Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
2021-02-02 21:22:44 +05:30
dzambare
48f2366b6f Updated BLIS version string to "AOCL BLIS X.x" format
AMD-Internal : [CPUPL-1394]

Change-Id: Ifebcb14d9eb064d231b831f5a1e151853ad5a009
2021-01-07 12:38:32 +05:30
Nagendra Prasad M
566f586547 Merge "Blis: DOTC Additional argument for Complex types when using FLANG" into amd-staging-milan-3.1 2020-12-21 06:03:11 -05:00
nprasadm
10ac4e2aba Blis: DOTC Additional argument for Complex types when using FLANG
Merged the changes done in UT Austin BLIS repo for DOTC Additional
argument.
Other modifications related to test application included.

Verifed the above code changes through scalapack test applications 'xztrd' , 'xctrd'

Change-Id: I7e16f3953db71890f9e8fbb0f7b363eaad899f62
Signed-off-by: Nagendra <Nagendra.PrasadM@amd.com>
AMD-Internal: [CPUPL-1323]
2020-12-16 14:03:10 +05:30
Kiran Varaganti
fc80892bb2 Improve sup GEMM performance (CCC - row prefer kernel)
Column-storage (CCC) case m is large and n & k are relatively small - row preferred kernels,
in this case var1n sup kernels are called. But actually block-panel var2m works better here.
After induced transposition the n becomes m which is large and m becomes n which is smaller.
The micropanels of induced B are larger than micropanels of induced A, therefore var2m is better option than var1n.
[CPUPL-1376]

Change-Id: I9214140d340ea4ac3edfefc31c465c926ba93326
2020-12-10 19:16:44 +05:30
Dipal M Zambare
66fd5e547a Update AMD copyright notice for current year.
Change-Id: I2ffd3d3306499922be15638d37c4d1e806acd36c
AMD-Internal: [CPUPL-1367]
2020-12-10 13:44:29 +05:30
Dipal M Zambare
38a8008cd8 Enabled znver3 flag for zen3 architecture
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3

Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
2020-12-04 12:28:22 +05:30
Meghana Vankadari
e083caf01d Merge "Correcting zdotc definition error for configs other than zen family" into amd-staging-milan-3.0 2020-12-01 06:20:10 -05:00
Dipal M Zambare
c2f63fcc54 Update amd64 bundle configuration
The configuration is updated to

   - Enable EPYC architecture optimizations
   - Macros to override block sizes.

AMD-Internal : [CPUPL-1350]

Change-Id: Id712f9abe6e81c9ece2baaab9d965b405e72977a
2020-12-01 14:37:13 +05:30
Meghana Vankadari
11b4cd8fc5 Correcting zdotc definition error for configs other than zen family
Details:
- when BLIS_CONFIG_EPYC is not defined, zdotc is defined twice.
  - One definition is part of macro based code.
  - Other definition is implemented as part of framework optimizations.
- Modified the bla_dot.c file to choose macro based code for configs
  other than zen family.

AMD-Internal: [CPUPL-1348]
Change-Id: I9ef6a590a6199e173d38248c3fb72feddfb20922
2020-12-01 13:33:59 +05:30
bhaskarn
91909c1562 Fix for segmentation crash in dgemmsup kernels
Description:

[AMD Internal]: CPUPL-1336

Removed extra/un-nesseary loads in dgemmmsup kernels which are
accessing the memory beyond the boundaries and causing segmentation
issue.

Kernels:
bli_dgemmsup_rd_haswell_asm_1x4
bli_dgemmsup_rv_haswell_asm_1x6

Change-Id: Idaeed36ebd9f13550943394a37e372b8d015b2d3
2020-11-24 10:15:57 -05:00
Kumar, Phani
477fc41fff Cmake script changes and blis.h changes for amd-staging-milan-3.0
AMD Internal : [CPUPL-1083]

Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580
2020-11-24 06:12:25 -05:00
Dipal M Zambare
0a3d94c9a2 Updated test drivers for dotv, scalv and swapv.
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.

AMD-Internal : [CPUPL-1315]

Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
2020-11-24 10:26:32 +05:30
Meghana Vankadari
97753d8e6b Modified log routines for gemm, gemmt and trsm
Details:
- Modified log routines to accept inputs from blas layer instead of
  oapi level.

AMD-Internal: [CPUPL-1332]
Change-Id: If33c3585af92e617910ae8f7d442d1275119bbfc
2020-11-23 04:53:15 -05:00
Madan mohan Manokar
1d8fab0996 Test driver fix for her and her2
Fixed test driver code for her, her2
Support added to handle complex and double complex data type in test driver.

Change-Id: If65939e99d8cf77e0fb70561166d84bf67d0321d
AMD-Internal: [CPUPL-1326]
2020-11-23 04:10:43 -05:00
Dipal Madhukar Zambare
22270aa9e4 Merge "Added debug log and trace for gemv and dotv for blis and cblas interface" into amd-staging-milan-3.0 2020-11-23 03:51:48 -05:00
managalv
fdc0e70cd8 Added debug log and trace for gemv and dotv for blis and cblas interface
AMD Internal: [CPUPL-1314]

Change-Id: I2708fd9c73419c968c8e02ff11545645dc639052
2020-11-23 19:55:21 +05:30
Kiran Varaganti
80a516382e Fixed wrong dimensions check in bench/bench_gemm.c application
Verifying the valid values of m, n, k, lda, ldb and ldc is removed.
Since the bench app is run on logs collected from AOCL traces.
The correct way of checking should consider transpose parameter and storage order.

Change-Id: If0fbf733c2650c6f328661293eb99d062685d638
2020-11-20 20:39:20 +05:30