- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
zen, skx and a couple of other kernels to cover all
contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
statements.
AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
- This change in made in MAKE build system.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsic then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Previous commit (75df1ef218) contains
similar changes on cmake build system
AMD-Internal: [CPUPL-5544]
Change-Id: I796e89f3fb2116d64c3a78af2069de20ce92d506
- This change in made in CMAKE build system only.
- Removed -fno-tree-loop-vectorize from global kernel flags,
instead added it to lpgemm specific kernels only.
- If this flag is not used , then gcc tries to auto
vectorize the code which results in usages of
vector registers, if the auto vectorized function
is using intrinsics then the total numbers of vector
registers used by intrinsic and auto vectorized
code becomes more than the registers
available in machine which causes read and writes
to stack, which is causing regression in lpgemm.
- If this flag is enabled globally, then the files which
do not use any intrinsic code do not get auto
vectorized.
- To get optimal performance for both blis and lpgemm,
this flag is enabled for lpgemm kernels only.
Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4
- Updated the existing code-path for ?AXPBYV to
reroute the inputs to the appropriate L1 kernel,
based on the alpha and beta value. This is done
in order to utilize sensible optimizations with
regards to the compute and memory operations.
- Updated the typed API interface for ?AXPBYV to include
an early exit condition(when n is 0, or when alpha is
0 and beta is 1). Further updated this layer to query
the right kernel from context, based on the input values
of alpha and beta.
- Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV,
?SCALV, ?SCAL2V and ?COPYV) to be used as part of special
case handling in ?AXPBYV.
- Moved the early return with negative increments from ?SCAL2V
kernels to its typed API interface.
- Updated the zen, zen2 and zen3 context to include function
pointers for all these vector kernels.
- Updated the existing ?AXPBYV vector kernels to handle only
the required computation. Additional cleanup was done to
these kernels.
- Added accuracy and memory tests for AVX2 kernels of ?SETV
?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs
- Updated the existing thresholds in ?AXPBYV tests for complex
types. This is due to the fact that every complex multiplication
involves two mul ops and one add op. Further added test-cases
for API level accuracy check, that includes special cases of
alpha and beta.
- Decomposed the reference call to ?AXPBYV with several other
L1 BLAS APIs(in case of the reference not supporting its own
?AXPBYV API). The decomposition is done to match the exact
operations that is done in BLIS based on alpha and/or beta
values. This ensures that we test for our own compliance.
AMD-Internal: [CPUPL-4861]
Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6
- Implemented AVX512 kernels for handling the calls to ZGEMV
with no-transpose to A matrix.
- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
The set of ZAXPYF kernels include those with fuse-factor 8
(main kernel), 4 and 2(fringe kernels).
- Updated the bli_zgemv_unf_var2( ... ) function to set
the function pointers to these kernels, based on the
configuration. Further added the call to ZSETV at this
layer in case beta is 0.
AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
Updated compiler id in cmake related files from
CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID
AMD-Internal: [CPUPL-2748]
Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371
Some text files were missing a newline at the end of the file.
One has been added.
AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
- Added 2x6 ZGEMM row-preferred kernel.
- Kernel supports prefetch_a, prefetch_b,
prefetch_a_next and prefetch_b_next.
- Multiple Ways to prefetch c are supported.
- prefetch_a and prefetch_c are enabled by
default.
- K loop is divided into multiple subloops for
better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
use of these kernels for TRSM in zen3 and
zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
windows build.
AMD-Internal: [CPUPL-3781]
Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.
AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
in config/zen4/make_defs.mk to enable BLIS to be build with
LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
one (zen->zen2->zen3->zen4), when they should be independent
of each other. Correct by including config/zen/amd_config.mk
in all zen make_defs.mk files to reinitialize the compiler
flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
this is already specified in amd_config.mk (and should
be the default setting anyway).
* Tidy files to simplify nested if structures and be more
consistent with one another.
AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
-Certain sections of the f32 avx512 micro-kernel were observed to
slow down when more post-ops are added. Analysis of the binary
pointed to false dependencies in instructions being introduced in
the presence of the extra post-ops. Addition of vzeroupper at the
beginning of ir loop in f32 micro-kernel fixes this issue.
-F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added.
-Alpha scaling (multiply instruction) by default was resulting in
performance regression when k dimension is small and alpha=1 in s32
micro-kernels. Alpha scaling is now only done when alpha != 1.
-s16 micro-kernel performance was observed to be regressing when
compiled with gcc for zen3 and older architecture supporting avx2.
This issue is not observed when compiling using gcc with avx512
support enabled. The root cause was identified to be the -fgcse
optimization flag in O2 when applied with avx2 support. This flag is
now disabled for zen3 and older zen configs.
AMD-Internal: [CPUPL-3067]
Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c
- Added SCAL2V kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero
or incx <= 0 or incy <= 0.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SSE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
- Added the new SCAL2V file from the CMAKE list.
AMD-Internal: [CPUPL-2773]
Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.
AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
to the function in place of context. Expert interface of ZSETV will
query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
BLAS macro interface only for single complex type and single complex,
float mixed type
AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
-Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics
code) when compiled using gcc. The presence of -fschedule-insns +
-fschedule-insns2 + -ftree-pre in O2 compiler optimization flags
results in the code being optimized to reduce data stalls, and results
in the usage of stack to store intermediate C register output. Disabling
-ftree-pre in gcc fixes the issue, even in the presence of the other
two flags.
AMD-Internal: [CPUPL-2971]
Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments
AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
-Implemented (r)ow preferential (d)ot product milli-kernels
(m and n variants) for dcomplex datatype along SUP path.
-These computational kernels extend the support for handling RRC and
CRC storage schemes along the SUP path. In case of BLAS api call,
it corresponds to the input cases with transa equal to T and
transb equal to N.
-In case of the B matrix being packed(conditionally), the inputs are
redirected to the existing (r)ow preferential (v)ector load optimized
kernels due to better performance.
-Added macro for vhsubpd assembly instruction, to support the arithmetic
for complex datatype in its interleaved storage.
AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
files
AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
cases are handled serially for cdotxv kernel
- Added declaration in zen contexts
AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context
AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
Details:
- Intrinsic implementation of ZAXPY2V fused kernel for AVX2
- Updated definitions in zen contexts
AMD-Internal: [CPUPL-2023]
Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
one of the three ways
-- It is updated to work across platforms.
-- Added in architecture/feature specific runtime checks.
-- Duplicated in AMD specific files. Build system is updated to
pick AMD specific files when library is built for any of the
zen architecture
AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
- Optimized dotxf implementation for double
and single precision complex datatype by
handling dot product computation in tile 2x6
and 4x6 handling 6 columns at a time, and rows
in multiple of 2 and 4.
- Dot product computation is arranged such a way
that multiple rho vector register will hold the
temporary result till the end of loop and finally
does horizontal addition to get final dot product
result.
- Corner cases are handled serially.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1975]
Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098
- Optimized axpy2v implementation for double
datatype by handling rows in mulitple of 4
and store the final computed result at the
end of computation, preventing unnecessary
stores for improving the performance.
- Optimal and reuse of vector registers for
faster computation.
AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
Details:
- Intrinsic implementation of axpbyv for AVX2
- Bench written for axpbyv
- Added definitions in zen contexts
AMD-Internal: [CPUPL-1963]
Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539
Details :
- Accuracy failures observed when fast math and ILP64 are enabled.
- Disabling the feature with macro BLIS_ENABLE_FAST_MATH .
AMD-Internal: [CPUPL-1907]
Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
norm with dot operation.
2. Added vectorization to compute square of 32 double element
block size from vector X.
3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
to compute nrm2 using new kernel.
4. Dot kernel definitions and implementation have a possibility for
accuracy issues .we can switch to traditional implementation by
disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
for Vector X .
AMD-Internal: [CPUPL-1757]
Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
-- Added -march=znver3 flag if the library is built for zen3
configuration with gcc compiler version 11 or above.
-- Replaced hardcoded compiler names 'gcc' and 'clang' with
variable $CC so that options are chosen as per the compiler
specified at configure time (instead of compiler in path).
AMD-Internal: [CPUPL-1823]
Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
adding threshold functions under a macro enables these functions for
all the configs under a family. This can be avoided by adding function
pointers to cntx which can be queried from cntx during run-time
based on the config chosen.
Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
Details:
1. Added prefetching next micro-panel of A and B in dgemm block,
which are helping in reducing load latency and improved performance.
2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
dgemm into macros and made it more modular
3. Packing and diagonal packing in main dgemm loops are modularized.
Fringe cases are yet to modularize.
4. Updated dtrsm small thresholds for single and multi thread cases
5. Updated div/scale based on disable/enable of trsm pre-inversion
6. Code clean up
Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
- AOCL Dynamic feature is added in BLIS which determines optimal
number of threads for the current problem size.
- This feature can be enabled/disabled by modifying the source
code
- This change adds support to enable/disable this feature during
configuration time by adding a new option in configure script
AOCL-Internal : [CPUPL-1565]
Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
global_rntm data structure remains unchanged in order to keep track of
original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
number of threads recommended by the application.
AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
1. NC and MC values are tuned for both single-instance and multi-instance run.
2. zen2 and zen3 configs updated.
3. SUP path disabled for zgemm, since tuned native path performed better.
To be re-enabled after setting right threshold for SUP selection.
AMD-Internal: [CPUPL-1442]
Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239
znver3 flag will be enabled if compiler is AOCC Clang version 3.0
and configuration is zen3
Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39
AMD-Internal: [CPUPL-1353]
Block sizes (MC, KC, NC) for DGEMM are determined at runtime
based on following parameters
- Single or multithreaded build
- Processor Architecture (currently support only zen3)
- Number of threads requested while running the library
Change-Id: Ia793484b77adb87486e630d0d3b4c7856ae52094
AMD-Internal: [CPUPL-660, CPUPL-661]
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
Details
- Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
- Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.
AMD-Internal: [CPUPL-1231]
Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
Details:
- Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions.
- Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance
AMD-Internal: [CPUPL-1057]
Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
accessible to all zen family architectures.
Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
Some of the SUP kernels now use rbp register.
This register was also used by compiler to support
automatic function call tracing, which was creating
the conflict. Automatic call tracing feature is
removed for now. If needed it can be enabled for
non kernel code.
Change-Id: Ib7ad00875f501ee2ad552cbb2ecdc245002d63b7
AMD-Internal: [CPUPL-1135]
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957