Details :
- Added packing Of Y for incy >1 cases for dgemv_unf_var2.
- Added packing Of X for incx >1 cases for dgemv_unf_var1.
AMD-Internal: [SWLCSG-735]
Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c
1. Added the compiler flags for the clang-cl compiler to build blis multithreading using openmp library.
2. Updated format of presenting version string.
AMD Internal : [CPUPL-1630]
Change-Id: I979de541fa57415c08c20b0d5b684ae6bd242d19
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.
AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
Removed unused function rm-dupls() from common.mk
Removed code from patch-ld-so.py which is not needed for AMD codebase.
AMD-Internal: [CPUPL-1539]
Change-Id: If1812d5aa87c1e3a9d0c4706d571223d56f2fc20
-- Ignore aocl dynamic configuration if multithreading is disabled.
AOCL Dynamic will also be disabled in this case.
-- Added following configuration settings in showconfig output
1. Complex return scheme
2. TRSM preinversion status
3. AOCL dynamic active status
AOCL-Internal: [CPUPL-1565]
Change-Id: Id5a31b233fc08dcd871de4a693aab0b2a5d9f1c4
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.
AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
- Added blas interface for dzgemm. This function will call
native implementation of gemm.
- Mixed datatype support is already present in BLIS. But this
implementation requires alpha_imag value to be 0.
- Modified test_gemm.c to support testing of dzgemm.
Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334
Details:
- Enabling packing of B helping in performance in sgemm when
all m,n,k dimensions are above 240 irrespective of the lda alignment.
- We may extend this optional enablement further for other skinny types
and incase of multithread scenarios.
Change-Id: Icb2a21e458cdcb0f8fdce373d8d0860c51be8d21
Details:
- BLIS has reserved rs = cs = 1 case only for 1x1 scalars.
- For vectors, even though rs = cs = 1 is a valid input, BLIS
adjusts the strides to satisfy the error checking.
- For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m
to indicate that this is a column vector stored in column major
order. Similarly BLIS sets rs = n in case of m = 1 and n > 1.
- So determining storage-scheme based on row-stride could lead to
errors if one of the matrices becomes vector.
- Modified bench files to determine storage scheme based on
stor_scheme character instead of checking row-strides.
Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527
Details:
1. Added aocl-dynamic for dtrsm native path
When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
trsm_small is doing better than native irrespective of
number of threads
Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application
by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes
are not set, number of threads is inferred by calling the API - omp_get_max_threads().
Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads.
This feature is only enabled when BLIS is configured with openmp parallelization.
Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6
Details:
- when AOCL dynamic is enabled, the decision to choose ST Vs MT
to solve SYRK is taken based on dimensions of matrices.
- Decisions to choose optimum number of threads will be updated in
the subsequent commits.
- Only local copy of rntm is modified by AOCL Dynamic feature.
global_rntm data structure remains unchanged in order to keep
track of original number of threads set by application.
- Added an early-exit condition in bli_nthreads_optimum when nt =1
or nt=-1. This ensures that AOCL dynamic feature is not used when
threading is set using BLIS_IC_NT or BLIS_JC_NT.
Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0
Details:
1. Added prefetching next micro-panel of A and B in dgemm block,
which are helping in reducing load latency and improved performance.
2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
dgemm into macros and made it more modular
3. Packing and diagonal packing in main dgemm loops are modularized.
Fringe cases are yet to modularize.
4. Updated dtrsm small thresholds for single and multi thread cases
5. Updated div/scale based on disable/enable of trsm pre-inversion
6. Code clean up
Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
1. Added support in cmake scripts for linking libomp for blis multithreading build.
2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.
AMD Internal : [CPUPL-1630]
Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.
Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
Details:
1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases.
2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL.
Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605
Details
- Passing enum rather than char for uplo, transa, and diaga
- Deleting log file, and other temp files, merged in the codebase from amax
AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
-- Fixed issues in printing the values of
side, uploa and diaga parameters for
hemm, hemv, her, her2, her2k, herk,
symm, symv, syr, syr2, syr2k, syrk,
trmm, trmv, trsm, trsv.
-- For above API's logging was called with MKSTR()
for side, uploa and diaga parameters. MKSTR is
needed only for macro arguments but not
for function's arguments.
-- Added space between function name and data type
where it was missing. Bench expects logs in
this format.
AMD-Internal: [CPUPL-1585]
Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d
Details:
- Implemented zaxpyf kernel with fuse factor=4 for zgemv.
- Modified BLAS interface call for zgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
Details:
- To determine whether matrices are col-stored, we were checking
ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1
if dimension is 1.
- Modified the condition to check row_stride instead of col stride.
if row-stride != 1, we can assume that matrices are not col-stored
and ignore those inputs by printing an error message.
Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87
Details:
- Added decision logic to choose between SUP and native implementations
of SYRK for zen2 architectures.
- For architectures other than zen2 it will be redirected to gemm
threshold function.
Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d
Completely disabling supvar1n (Panel Block) gemm to simplify things
supvar1n perform better only when m >> and n=k=small (<10). This
simplification will improve performance for m = n shape dgemm.
Change-Id: I523fcb211e8ab92718ea7367f9707a38275e24b1
since 3m1 is turned off in bla_gemm.c, setting FALSE for 3m1 in bli_l3_ind_oper_st
AMD-Internal: [CPUPL-1592]
Change-Id: I80dfe7c993f9edfbf752b7351cfdaa22a9e60035
Details:
1. Added optimized dtrsm kernels for all 8 right side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a01 block required for GEMM operation. The block size
increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
from one end of A to other end of triangular A
b. Packing of 6 diagonal elements in one location helped to utilize
cache line efficiently
AMD-Internal: [CPUPL-1563]
Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
When op(A) or op(B) = transpose - the leading dimensions of these matrices altered.
Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this
mistake in both column and row storages.
Provide a provision to call BLIS interfaces when row-major inputs are used.
Change-Id: Id2041af219a64567471c14190f283274d1df2f7f
- Added bench utility for dotv and scalv API's
- Corrected logging for scalv to handle complex types
- Corrected logging to remove transpose field from dotv logs
AOCL-Internal: [CPUPL-1577]
Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07
- incx and incy was not considered while allocating
memory for x and y vectors.
- Updated test data set
AMD-Internal: [CPUPL-1578]
Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05
Details:
1. Added optimized dtrsm kernels for all 8 left side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a10 block required for GEMM operation. The block size
increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
from one end of A to other end of triangular A
b. Performing inregister transpose whenever required
c. Packing of 8 diagonal elements in one location helped to utilize
cache line efficiently
2. Enabled calling dtrsm small for smaller sizes at cblas level itself
to avoid frame work overhead, which is significant for very small
sizes
3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
and manideep.kurumella@amd.com for implementing lut kernels
using intrinsics.
4. Removed all older implementations of strsm which are not
developed as per the guide lines, can be refered from
older releases if required.
Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6