-- Conditionally packing of B matrix is enabled in zgemmsup path
which is performing better when B matrix is large
-- Incorporated decision logic to choose between zgemm_small vs
zgemm sup based on matrix dimensions "m, n and k".
-- Calling of ZGEMV when matrix dimension m or n = 1.
Very good performance improvement is observed.
Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:
- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.
The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.
AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
The current implementation for handling zgemm exploits SIMD parallelism
along the k dimension. This would give great performance in cases of k
being large. But for input sizes with k=1, it is better to exploit SIMD
parallelism along the m and n dimensions, thereby giving better
performance. This commit does the same through loop reordering, by
loading column vectors from A.
AMD-Internal: [CPUPL-2236]
Change-Id: Ibfa29f271395497b6e2d0127c319ecb4b883d19f
- Need to identify new Thresholds for zgemm SUP path to avoid performance regression.
AMD-Internal: [CPUPL-2148]
Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43
- Completed zen4 configuration support on windows
- Enabled AVX512 kernels for AMAXV
- Added zen4 configuration in amdzen for windows
- Moved all zen4 kernels inside kernels/zen4 folder
AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
- Ensured that FMA, AVX2 based kernels are called only on platforms
supporting these instructions, otherwise standard ‘C’ kernels will
be called.
- Code cleanup for optimization and consistency
AMD-Internal: [CPUPL-2126]
Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050
- Previously zgemm computation failures were due to
status variable did not have pre-defined initial
value which resulted in zgemm computation to return
without being computed by any kernel. Reflected
same change in dgemm_ function as well.
- Enabled sup zgemm as the issue is fixed with
status variable with bli_zgemm_small call.
-Removed calling sqp method as it is disabled
Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.
AMD-Internal: [CPUPL-2103]
Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.
Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
sizes 64<= m,n <= 256, by keeping the number of
threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
used for all variants.
AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.
Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
Details:
- Enable ztrsm small implementation
- For small sizes, Right Variants and Left Unit Diag
Variants are using ztrsm_small implementations.
- Optimization of Left Non-Unit Diagonal Variants,
Work In Progress
AMD-Internal: [SWLCSG-1194]
Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.
AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
Description:
1. For small dimensions single threads dgemm_small performing
better than dgemmsup and native paths.
2. Irrespecive of given number of threads we are redirecting
into single thread dgemm_small
AMD-Internal:[CPUPL-2053]
Change-Id: If591152d18282c2544249f70bd2f0a8cd816b94e
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.
This commit adds changes needed for windows build.
AMD-Internal: [CPUPL-2052]
Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
Summary:
1. This commit fixed issue for gemv and axpy API’s.
2. The BLIS binary with dynamic dispatch feature was
crashing on non-zen CPUs (specifically CPUs without
AVX2 support).
3. The crash was caused by un-supported instructions
in zen optimized kernels.The issue is fixed by calling
only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.
Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
Direct calls to zen kernels replaced by architecture
dependent calls for dotv and amaxv kernels. For non-zen
architecture, generic function is called using the BLIS
interface. For zen architecture, direct calls to zen
optimized kernels are made.
Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
Removed direct calling of zen kernels in cblas source itself.
Similar optimizations are done by the function directly invoked from
Cblas layer.
The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea
This commit fixed issue for gemm and copy API’s.
The BLIS binary with dynamic dispatch feature was crashing on non-zen
CPUs (specifically CPUs without AVX2 support).
The crash was caused by un-supported instructions in zen optimized kernels.
The issue is fixed by calling only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.
AMD-Internal: [CPUPL-1930]
Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.
Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.
Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels
Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels
Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
-- Added number of threads used in DTL logs
-- Added support for timestamps in DTL traces
-- Added time taken by API at BLAS layer in the DTL logs
-- Added GFLOPS achieved in DTL logs
-- Added support to enable/disable execution time and
gflops printing for individual API's. We may not want
it for all API's. Also it will help us migrate API's
to execution time and gflops logs in stages.
-- Updated GEMM bench to match new logs
-- Refactored aocldtl_blis.c to remove code duplication.
-- Clean up logs generation and reading to use spaces
consistently to separate various fields.
-- Updated AOCL_gettid() to return correct thread id
when using pthreads.
AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels
Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
Details:
The basic idea is inverse of singular matrix doesn't exist,
therefore we should be returning NAN. BLAS standard and BLIS
is optimizing by not doing any compute when x[j]== 0. As a
result BLIS is generating finite values for inverse
calculation of singular matrices which in reality is not
the right answer. Fix is provided in this commit to generate
NAN/INF values incase this API is called to compute inverses
of singular matrices. But according to the standard, this API
shouldn't be called in the first place, the check for singularity
or near singularity should be done by the calling application
Change-Id: Iccdbc07744de3892626f4066ee4a63eb30bc06cd
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.
Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
details: Wrapper code will be enabled when selecting the cmake option
ENABLE_WRAPPER and also this commit will fixing the ScaLAPACK build
error on windows.
AMD-Internal: [CPUPL-1848]
Change-Id: I3d687cbc00e7603fdfb45937a00daf86bd07878e
Details:
BLIS currently supports BLAS and CBLAS interfaces with lowercase.
With this commit - we also supports uppercase with and without
trailing underscore, lowercase without trailing underscore symbol
names.
Change-Id: Ibb06121821ab937b25d492409625916f542b2135
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed N > 3 condition from bli_dgemm_small.
Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
1. single instance case sup is enabled.
2. Env BLIS_SINGLE_INSTANCE should be set to 1 to enable single instance tuning.
AMD-Internal: [CPUPL-1743]
Change-Id: Iadb05a6e9313ac41271c0522da243fd47d80abec
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.
AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.
AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
- Added blas interface for dzgemm. This function will call
native implementation of gemm.
- Mixed datatype support is already present in BLIS. But this
implementation requires alpha_imag value to be 0.
- Modified test_gemm.c to support testing of dzgemm.
Change-Id: I496fffdede9f0f778b9a33b405eb6861c6dcc334
Details:
1. Added aocl-dynamic for dtrsm native path
When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
trsm_small is doing better than native irrespective of
number of threads
Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
Details:
1. Added prefetching next micro-panel of A and B in dgemm block,
which are helping in reducing load latency and improved performance.
2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
dgemm into macros and made it more modular
3. Packing and diagonal packing in main dgemm loops are modularized.
Fringe cases are yet to modularize.
4. Updated dtrsm small thresholds for single and multi thread cases
5. Updated div/scale based on disable/enable of trsm pre-inversion
6. Code clean up
Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.
Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
-- Fixed issues in printing the values of
side, uploa and diaga parameters for
hemm, hemv, her, her2, her2k, herk,
symm, symv, syr, syr2, syr2k, syrk,
trmm, trmv, trsm, trsv.
-- For above API's logging was called with MKSTR()
for side, uploa and diaga parameters. MKSTR is
needed only for macro arguments but not
for function's arguments.
-- Added space between function name and data type
where it was missing. Bench expects logs in
this format.
AMD-Internal: [CPUPL-1585]
Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d
Details:
- Implemented zaxpyf kernel with fuse factor=4 for zgemv.
- Modified BLAS interface call for zgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.
AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
Details:
1. Added optimized dtrsm kernels for all 8 right side cases
Below are few notable optimizations which improved performance
a. Loading, transposing (for transa cases), packing and reusing
of a01 block required for GEMM operation. The block size
increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
from one end of A to other end of triangular A
b. Packing of 6 diagonal elements in one location helped to utilize
cache line efficiently
AMD-Internal: [CPUPL-1563]
Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3