- The current implementation of syrk_small computes the entire C matrix
rather than computing triangular part. This implementation is not
efficient.
AMD-Internal: [CPUPL-1571]
Change-Id: I9a153207471a55e52634429062d18ba1a225fed9
The cpp and testcpp folder exists in root directory as well as vendor
directory. Only folder in vendor directory are needed.
Removed duplicate directories and updated makefiles to pick the
sources from vendor folder.
AMD-Internal: [CPUPL-1834]
Change-Id: I178043a09fd746660938b89ecce73c53d6c53409
-- Added -march=znver3 flag if the library is built for zen3
configuration with gcc compiler version 11 or above.
-- Replaced hardcoded compiler names 'gcc' and 'clang' with
variable $CC so that options are chosen as per the compiler
specified at configure time (instead of compiler in path).
AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.
Change-Id: Ibb06121821ab937b25d492409625916f542b2135
Details:
1. Unrolled by a factor 5. This gave around 1GFLOPS gain
2. Changed CMP to subs and remove nan. CMP uses a lot of
compare, which is higher in latency and more number of
instructions. Replacing with subs and remove nan
reduced it to 3 instructions and lighter ones.
3. Added remove nan function.
4. Added AVX512 definition in skx context.
5. Disabled code in AMAXV kernel depending on AVX512 flag
exists or not
Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
Details:
- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
implementation for better performance
- Unrolled the loop by order of 4.
Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
- The function "bli_determine_blocksize_b_sub" uses a modulo operation
using b_alg. In cases where trsm blocksizes are not set, using this
function with trsm blocksizes being zero will lead to FPE.
- Added a check to avoid calling the above function with zeroes as
parameters.
Change-Id: I770aaa13125b55320a68ff9fc3da782111e0978a
1. Left Lower non-trans,Left Upper trans
2. Left Upper non-trans,Left Lower trans
3. Right Lower non-trans.Right Upper trans
4. Right Upper non-trans,Right Lower trans
Change-Id: I0b0155d7c3a55ec74d53c8f1f49f1bceb63b15f5
Details:
- The field "trsm_blkszs" in cntx enables us to set and use blocksizes
that are specific to trsm only instead of using commom L3 blocksizes.
- While querying blocksizes for TRSM, Added a check to see if
trsm-specific blocksizes are set, if they are not set, we continue
trsm execution by using common Level-3 blocksizes.
- TRSM-specific blocksizes( if required ) needs to be set in
cntx_init function of respective configuration.
Change-Id: I06afe2b1d29004e428d2fb3bf2be2958cbbd1a97
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
adding threshold functions under a macro enables these functions for
all the configs under a family. This can be avoided by adding function
pointers to cntx which can be queried from cntx during run-time
based on the config chosen.
Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
-- In existing design memory pool supports buffers of only one
size, This size is determined at compile time to support buffer
needed for biggest block size and data type. However, the size
calculation is not generic as it considers sizes only for GEMM.
Also it assumes that the buffer will not be used for any other
purpose than packing operations.
-- If the new buffer is requested whose size is bigger than existing
size, The pool is re-initialized to contain buffers of new size,
however, this is done only if the pool is empty. If the pool is not
empty the execution is aborted. This is undesirable as in
mulithreaded scenarios it is possible that different threads needs
buffers of different sizes at the same time (i.e. while other thread
is still in middle of the operation).
-- This commit removes the restriction of single size buffer, when
new buffer is requested whose size if bigger than exiting one, no
re-init is done. New buffer of required size is allocated, added
to the pool and returned to the user.
Change-Id: I20acdb60eb06ab2e53366d51713aa83c4b2df0da
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed N > 3 condition from bli_dgemm_small.
Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
Details:
- Added a check condition in GEMMT native path to choose
update_triang routines based on whether the kernels are
row-preferred or column preferred.
- Moved the zen-specific SUP thresholds under BLIS_CONFIG_EPYC
macro. For non-zen architectures, it falls back to L3 SUP
thresholds.
- Modified SUP code path to always choose 2m variant. 1n variant
is not implemented for GEMMT.
Change-Id: Ifdd55815c588f645e337de80b5b9d1864f6b5dd3
Description:
1. Added fflush after fprintf in aocldtl so avoid mising any log
info when application crashes
2. Removed one redundent if def statement
Change-Id: I79a645060fa36bd8c7e5eaca4d9183dc944329ea
Description:
1. While processing reminder cases in bli_trsm_small algorithm
there were few loads and stores which were accessing
beyond the given matrix buffer because of vectorized instructions.
2. Modified 256bit vector loads at edges into 128bit or 64 bit loads/stores
such that no read/write happens beyond the matrix boundary.
AMD-Internal: [CPUPL-1759] [SWLCSG-819]
Change-Id: Iba51d0ed9bb28d1b0948a219755b8dbcc86a7fa9
1. single instance case sup is enabled.
2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning.
AMD-Internal: [CPUPL-1743]
Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec
Details :
- Added packing Of Y for incy >1 cases for dgemv_unf_var2.
- Added packing Of X for incx >1 cases for dgemv_unf_var1.
AMD-Internal: [SWLCSG-735]
Change-Id: Ib395f478ba984a85533e4f79b3521d0b2500c30c
1. Added the compiler flags for the clang-cl compiler to build blis multithreading using openmp library.
2. Updated format of presenting version string.
AMD Internal : [CPUPL-1630]
Change-Id: I979de541fa57415c08c20b0d5b684ae6bd242d19
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.
AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
Removed unused function rm-dupls() from common.mk
Removed code from patch-ld-so.py which is not needed for AMD codebase.
AMD-Internal: [CPUPL-1539]
Change-Id: If1812d5aa87c1e3a9d0c4706d571223d56f2fc20
-- Ignore aocl dynamic configuration if multithreading is disabled.
AOCL Dynamic will also be disabled in this case.
-- Added following configuration settings in showconfig output
1. Complex return scheme
2. TRSM preinversion status
3. AOCL dynamic active status
AOCL-Internal: [CPUPL-1565]
Change-Id: Id5a31b233fc08dcd871de4a693aab0b2a5d9f1c4
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.
AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
- Added blas interface for dzgemm. This function will call
native implementation of gemm.
- Mixed datatype support is already present in BLIS. But this
implementation requires alpha_imag value to be 0.
- Modified test_gemm.c to support testing of dzgemm.
Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334
Details:
- Enabling packing of B helping in performance in sgemm when
all m,n,k dimensions are above 240 irrespective of the lda alignment.
- We may extend this optional enablement further for other skinny types
and incase of multithread scenarios.
Change-Id: Icb2a21e458cdcb0f8fdce373d8d0860c51be8d21
Details:
- BLIS has reserved rs = cs = 1 case only for 1x1 scalars.
- For vectors, even though rs = cs = 1 is a valid input, BLIS
adjusts the strides to satisfy the error checking.
- For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m
to indicate that this is a column vector stored in column major
order. Similarly BLIS sets rs = n in case of m = 1 and n > 1.
- So determining storage-scheme based on row-stride could lead to
errors if one of the matrices becomes vector.
- Modified bench files to determine storage scheme based on
stor_scheme character instead of checking row-strides.
Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527
Details:
1. Added aocl-dynamic for dtrsm native path
When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
trsm_small is doing better than native irrespective of
number of threads
Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application
by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes
are not set, number of threads is inferred by calling the API - omp_get_max_threads().
Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads.
This feature is only enabled when BLIS is configured with openmp parallelization.
Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6
Details:
- when AOCL dynamic is enabled, the decision to choose ST Vs MT
to solve SYRK is taken based on dimensions of matrices.
- Decisions to choose optimum number of threads will be updated in
the subsequent commits.
- Only local copy of rntm is modified by AOCL Dynamic feature.
global_rntm data structure remains unchanged in order to keep
track of original number of threads set by application.
- Added an early-exit condition in bli_nthreads_optimum when nt =1
or nt=-1. This ensures that AOCL dynamic feature is not used when
threading is set using BLIS_IC_NT or BLIS_JC_NT.
Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0
Details:
1. Added prefetching next micro-panel of A and B in dgemm block,
which are helping in reducing load latency and improved performance.
2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
dgemm into macros and made it more modular
3. Packing and diagonal packing in main dgemm loops are modularized.
Fringe cases are yet to modularize.
4. Updated dtrsm small thresholds for single and multi thread cases
5. Updated div/scale based on disable/enable of trsm pre-inversion
6. Code clean up
Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
1. Added support in cmake scripts for linking libomp for blis multithreading build.
2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.
AMD Internal : [CPUPL-1630]
Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.
Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
Details:
1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases.
2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL.
Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605
Details
- Passing enum rather than char for uplo, transa, and diaga
- Deleting log file, and other temp files, merged in the codebase from amax
AOCL-Internal: [CPUPL-1591]
Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2
-- Fixed issues in printing the values of
side, uploa and diaga parameters for
hemm, hemv, her, her2, her2k, herk,
symm, symv, syr, syr2, syr2k, syrk,
trmm, trmv, trsm, trsv.
-- For above API's logging was called with MKSTR()
for side, uploa and diaga parameters. MKSTR is
needed only for macro arguments but not
for function's arguments.
-- Added space between function name and data type
where it was missing. Bench expects logs in
this format.
AMD-Internal: [CPUPL-1585]
Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d