1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.
AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
Replaced "*MKSTR(ch)" in the DTL call "AOCL_DTL_LOG_GEMM_INPUTS(AOCL_DTL_LEVEL_TRACE_1, *MKSTR(ch)...)" with "D" and "Z" for dgemm_ and zgemm_ respectively to prevent printing wrong data-type.
[CPUPL-1449]
Change-Id: Ic91537189352bdb164411799e127de990a5c9a08
Details:
- This implementation does a transpose operation while packing 16xk of A
buffer and passes it to 16x3-nn kernel.
- The same implementation works for the case where B has transpose.
AMD-Internal: [CPUPL-1376]
Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17
Made changes to dgemm_ and zgemm_ interfaces to support multi-thread GEMM implementations. When number of threads is greater than one, we call multi-threaded gemm (sup or native) and for single thread version we call different flavors of single-thread gemm implementations decided based on the matrix dimensions.
[CPUPL-1376]
Change-Id: I2e37145ec9a07d6b7e7be1719bd49239e813aa8a
Details:
- Decision logic to choose small_gemm has been moved to blas interface.
- Redirecting all the calls to small_gemm from gemm_front to native
implementation.
AMD-Internal: [CPUPL-1376]
Change-Id: I6490f67113e9f7c272269f441c86f2a0b3c89a53
Details:
- This kernel works best for cases where k = 1.
- This implementation is called directly from blas interface when A, B
matrices have no-transpose and k = 1.
AMD-Internal: [CPUPL-1376]
Change-Id: I3b31673a28290c81d4a4cb64c8605d56e50b5d3d
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.
AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.
Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
Modified dgemm_ to able to call small_gemm 16x3 kernel.
small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2.
small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel.
[CPUPL - 1376]
Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b
TRSM API: AX = B, where X=B
Case1: Call TRSV when matrix B is vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix B is vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ib020f2a568f04a6e8d8f75bfc38adbfd7c5d175a
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.
Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
Case1: Call TRSV when matrix C & B are vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.
Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
Merged the changes done in UT Austin BLIS repo for DOTC Additional
argument.
Other modifications related to test application included.
Verifed the above code changes through scalapack test applications 'xztrd' , 'xctrd'
Change-Id: I7e16f3953db71890f9e8fbb0f7b363eaad899f62
Signed-off-by: Nagendra <Nagendra.PrasadM@amd.com>
AMD-Internal: [CPUPL-1323]
Column-storage (CCC) case m is large and n & k are relatively small - row preferred kernels,
in this case var1n sup kernels are called. But actually block-panel var2m works better here.
After induced transposition the n becomes m which is large and m becomes n which is smaller.
The micropanels of induced B are larger than micropanels of induced A, therefore var2m is better option than var1n.
[CPUPL-1376]
Change-Id: I9214140d340ea4ac3edfefc31c465c926ba93326
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3
Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
Details:
- when BLIS_CONFIG_EPYC is not defined, zdotc is defined twice.
- One definition is part of macro based code.
- Other definition is implemented as part of framework optimizations.
- Modified the bla_dot.c file to choose macro based code for configs
other than zen family.
AMD-Internal: [CPUPL-1348]
Change-Id: I9ef6a590a6199e173d38248c3fb72feddfb20922
Description:
[AMD Internal]: CPUPL-1336
Removed extra/un-nesseary loads in dgemmmsup kernels which are
accessing the memory beyond the boundaries and causing segmentation
issue.
Kernels:
bli_dgemmsup_rd_haswell_asm_1x4
bli_dgemmsup_rv_haswell_asm_1x6
Change-Id: Idaeed36ebd9f13550943394a37e372b8d015b2d3
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.
AMD-Internal : [CPUPL-1315]
Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
Fixed test driver code for her, her2
Support added to handle complex and double complex data type in test driver.
Change-Id: If65939e99d8cf77e0fb70561166d84bf67d0321d
AMD-Internal: [CPUPL-1326]
Verifying the valid values of m, n, k, lda, ldb and ldc is removed.
Since the bench app is run on logs collected from AOCL traces.
The correct way of checking should consider transpose parameter and storage order.
Change-Id: If0fbf733c2650c6f328661293eb99d062685d638
Fixed test driver code for her, her2, herk and her2k function.
Above functions supports only complex and double complex data type, test code is updated accordingly.
Change-Id: Iee7b79abda4a2959a265c420d23879bf47f2c38d
AMD-Internal: [CPUPL-1313]
Block sizes (MC, KC, NC) for DGEMM are determined at runtime
based on following parameters
- Single or multithreaded build
- Processor Architecture (currently support only zen3)
- Number of threads requested while running the library
Change-Id: Ia793484b77adb87486e630d0d3b4c7856ae52094
AMD-Internal: [CPUPL-660, CPUPL-661]
Added blis.h in aoclos.c in order to check if BLIS was
build with openmp support.
AOCL-Internal: [CPUPL-1238]
Change-Id: I366da030266b9d7f2ad09dc722847a7d86b85933
Details:
Native method is being enabled for complex gemm
Need to run performance for large dataset to enable induced method
MD-Internal: [CPUPL-1300]
Change-Id: I5444dd31e8b8e73da73f789da8b64276e8e40de8
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a