Details:
- Implemented zaxpyf kernel with fuse factor=4 for zgemv.
- Modified BLAS interface call for zgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
Details:
1. Added optimized dtrsm kernels for all 8 right side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a01 block required for GEMM operation. The block size
increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
from one end of A to other end of triangular A
b. Packing of 6 diagonal elements in one location helped to utilize
cache line efficiently
AMD-Internal: [CPUPL-1563]
Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
Details:
1. Added optimized dtrsm kernels for all 8 left side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a10 block required for GEMM operation. The block size
increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
from one end of A to other end of triangular A
b. Performing inregister transpose whenever required
c. Packing of 8 diagonal elements in one location helped to utilize
cache line efficiently
2. Enabled calling dtrsm small for smaller sizes at cblas level itself
to avoid frame work overhead, which is significant for very small
sizes
3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
and manideep.kurumella@amd.com for implementing lut kernels
using intrinsics.
4. Removed all older implementations of strsm which are not
developed as per the guide lines, can be refered from
older releases if required.
Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6
Details:
Address increment was missing in bli_sgemmsup_rv_zen_asm_1x16 kernel
while storing output in column major order in beta zero case
JIRA: CPUPL-1548
Change-Id: I36269cd28de6fbef2256451e399f90f0437b0ce1
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Removed memory operations which were being unused
Modified labels to be unquie to a file
Rowstride update is done at once to avoid multiple mul instruction
AMD Internal : [CPUPL-1419]
Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.
AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
1. CMake script changes for adding new files to the build.
2. Added Upper case support for couple of API's.
3. bool is not support in clang so defined it.
AMD Internal : [CPUPL-1422]
Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
Details:
- This implementation does a transpose operation while packing 16xk of A
buffer and passes it to 16x3-nn kernel.
- The same implementation works for the case where B has transpose.
AMD-Internal: [CPUPL-1376]
Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17
Details:
- Decision logic to choose small_gemm has been moved to blas interface.
- Redirecting all the calls to small_gemm from gemm_front to native
implementation.
AMD-Internal: [CPUPL-1376]
Change-Id: I6490f67113e9f7c272269f441c86f2a0b3c89a53
Details:
- This kernel works best for cases where k = 1.
- This implementation is called directly from blas interface when A, B
matrices have no-transpose and k = 1.
AMD-Internal: [CPUPL-1376]
Change-Id: I3b31673a28290c81d4a4cb64c8605d56e50b5d3d
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.
AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.
Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
Modified dgemm_ to able to call small_gemm 16x3 kernel.
small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2.
small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel.
[CPUPL - 1376]
Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b
TRSM API: AX = B, where X=B
Case1: Call TRSV when matrix B is vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix B is vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ib020f2a568f04a6e8d8f75bfc38adbfd7c5d175a
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.
Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
Case1: Call TRSV when matrix C & B are vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.
Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
Details
- Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
- Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.
AMD-Internal: [CPUPL-1231]
Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
Details:
- Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions.
- Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance
AMD-Internal: [CPUPL-1057]
Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
accessible to all zen family architectures.
Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
- Bug fix in sgemmsup 1x16 Kernel for Beta Zero and with C col storage
rcx register incrementing was missing because of this 4 values
in output are overwritten
Change-Id: Ia3028040dce3e615f1db5a331498d86faadcf916
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.
Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.
AMD Internal: [CPUPL-806]
Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Added traces from blas/cblas API's till kernels for dgemm and sgemm.
By default the traces will be disabled, user need to enable them
in their local workspace, please check aocl_dtl/aocldtlcf.h file.
AMD Internal : CPUPL-806
Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
Failure was seen in libflame function (FLASH_UDdate_UT_inc)
Due to typecasting double complex pointer as double pointer
Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
Details:
Using of ymm registers storing 8 float values than 4 floats values
Changed register from ymm to xmm in required places. This can be found
only when leading dimension is greater than the actual dimension.
Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings
Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05