-The block sizes and micro kernel dimensions for the F32OF32 group
of APIs are updated in the element wise operations cntx map.
AMD-Internal: [SWLCSG-3390]
Change-Id: Ic5690b7eb4f7b2559d893f374dd811b00e31e329
- Added early return checks for A/B transpose cases and Column major
support, as it is not currently supported.
- Enabled the JIT kernels for the Zen4 architecture.
AMD Internal: [SWLCSG - 3281]
Change-Id: Ie671676c51c739dd18709892414fd34d26a540df
Description:
Implemented a c reference for
aocl_gemm_unreorder_bf16bf16f32of32 function
The implementation working for row major and
column major yet to be enabled.
AMD-Internal: [ SWLCSG-3279 ]
Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
In order to support AVX2 kernels, the BF16 input has to be converted
to F32 and then the F32 kernels has to be executed.
- Added un-pack function for the B-Matrix, which does the unpacking of
the Re-ordered BF16 B-Matrix and converts it to Float.
- Added a kernel, to convert the matrix data from Bf16 to F32 for the
give input.
- Added a new path to the BF16 5LOOP to work with the BF16 data, where
the packed/unpacked A matrix is converted from BF16 to F32. The
packed B matrix is converted from BF16 to F32 and the re-ordered B
matrix is unre-ordered and converted to F32 before feeding to the
F32 micro kernels.
- Removed AVX512 condition checks in BF16 code path.
- Added the Re-order reference code path to support BF16 AVX2.
- Currently the F32 AVX-2 kernels supports only F32 BIAS support.
Added BF16 support for BIAS post-op in F32 AVX2 kernels.
- Bug fix in the test input generation script.
AMD Internal : [SWLCSG - 3281]
Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
Description:
1. Added new output types for f32 element wise API's to support
s8, u8, s32 , bf16 outputs.
2. Updated the base f32 API to support all the post-ops supported in
gemm API's
AMD Internal: [SWLCSG-3384]
Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
- Add missing xmm, ymm and k registers to clobber lists
in bli_dgemmsup_rv_zen4_asm_24x8m.c
- Add missing ymm1 in bli_dgemmsup_rv_zen4_asm_24x8m.c
bli_gemmsup_rv_haswell_asm_d6x8m.c and bli_gemmsup_rd_zen_s6x64.c
- Also change formatting in bli_copyv_zen4_asm_avx512.c
bli_dgemm_avx512_asm_8x24.c and bli_zero_zmm.c to make
automatic processing of clobber lists easier.
AMD-Internal: [CPUPL-5895]
Change-Id: If05a3f00e6c0f9033eeced5de165ba4c3128b3e5
Optionally enable parallelism inside gtestsuite so that we can
check BLIS functions perform correctly when nested parallelism
is in operation. Enable with:
cmake ... -DOPENMP_NESTED={0,1,2,1diff}
where in gtestsuite
- 0 is the default choice with no parallelism.
- 1 and 2 are simple nested parallelism.
- 1diff has one level of parallelism setting different numbers
of threads to be used by BLIS and reference library calls
from each gtestsuite thread.
Note: OMP_NUM_THREADS must be set appropriately to enable or
disable parallelism at each level in the test programs
as desired.
OMP_NUM_THREADS will also define the parallelism used
within the BLIS library (if it is multithreaded), unless
BLIS-specific ways of specifying parallelism have been
used. If a BLIS-specific parallelism option has been set,
the same mechanism will be used in the 1diff option to
vary the number of threads in BLIS per application thread.
AMD-Internal: [CPUPL-3902]
Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771
-The following S16 APIs are removed:
1. aocl_gemm_u8s8s16os16
2. aocl_gemm_u8s8s16os8
3. aocl_gemm_u8s8s16ou8
4. aocl_gemm_s8s8s16os16
5. aocl_gemm_s8s8s16os8
along with the associated reorder APIs and corresponding
framework elements.
AMD-Internal: [CPUPL-6412]
Change-Id: I251f8b02a4cba5110615ddeb977d86f5c949363b
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
Various occurances of the following compiler warnings have been
fixed:
* Type mismatch
* Misleading code indentation
* Array bounds violation warning in blastest when using gcc 11
without -fPIC flag
AMD-Internal: [CPUPL-5895]
Change-Id: Ia5d5310b76a66e87ad3953a72e8472ed5b01e588
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.
AMD-Internal: [SWLCSG-3380, SWLCSG-3258]
Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
Details:
- Added a new python script that can test all microkernels
along with post-ops.
- Modified post_op freeing function to avoid memory leaks.
Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
AMD-Internal: [SWLCSG-3380]
Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
- Updated the format specifiers to have a leading space,
in order to delimit the outputs appropriately in the
output file.
- Further updated every source file to have a leading space
in its format string occuring after the macros.
AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
- Added 32x3n n-biased kernels to directly handle the cases where n=3
which were earlier being handled by the primary n-biased, 32x8n,
kernel.
- Modified the n-biased fringe kernels to further handle the smaller
m-fringe cases. Thus, now the kernels handle the following range of m
for any value of n:
- 16x8n : m = [16, 31)
- 8x8n : m = [8, 15)
- m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
granularity to invoke the smaller fringe cases directly on the basis
of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.
AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.
AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
Description:
1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
Updated Makefile and main CMakelists.txt to replace absolute path within library object files with whatever path specified.
In files where __FILE__ macro is used, the absolute path was appearing in the library. Now this will be replaced with relative paths.
AMD-Internal: [CPUPL-5910]
Change-Id: Iac63645348a7f8214123fdcd3675670eedd887e3
- Modified bench to support testing of different types of buffers
for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
type.
Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.
AMD-Internal: [SWLCSG-3333]
Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
Details:
- When using regexes in Python, certain characters need backslash escaping, e.g.:
```python
regex = re.compile( '^[\s]*#include (["<])([\w\.\-/]*)([">])' )
```
However, technically escape sequences like `\s` are not valid and should actually be double-escaped: `\\s`.
Python 3.12 now warns about such escape sequences, and in a later version these warning will be promoted
to errors. See also: https://docs.python.org/dev/whatsnew/3.12.html#other-language-changes. The fix here
is to use Python's "raw strings" to avoid double-escaping. This issue can be checked for all files in the current
directory with the command `python -m compileall -d . -f -q .`
- Thanks to @AngryLoki for the fix.
AMD-Internal: [CPUPL-5895]
Change-Id: I7ab564beef1d1b81e62d985c5cb30ab6b9a937f2
(cherry picked from commit 729c57c15a)
This reverts commit a028108cbb.
Reason for revert: With libgomp, scalability issues were observed with a
higher number of threads, leading to the use of fewer
threads. However, with different OpenMP libraries
like libomp, this scalability issue was not observed,
and using fewer threads resulted in performance loss.
The AOCL dynamic logic has been updated to select a
higher number of threads, considering the iomp OpenMP
library.
Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f
Added separate package configuration file for
st and mt library in blis Makefile and CMakeLists.txt
Change-Id: I8d851fac10d63983358e1f4c67fd9451246056bf
- Added a conditional check to invoke the vectorized
DCOPYV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal for
single-threaded execution. Thus, we avoid the call to
bli_nthreads_l1() function to set the ideal number of threads.
- Used macros to protect the declaration of fast_path_thresh in
DAXPYV API to avoid compiler warnings.
AMD-Internal: [CPUPL-4875][CPUPL-5895]
Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb
- Mixed precision datatypes use a modified cntx.
- For some variants of mixed precision, complex and real blocksizes
are needed to be same. This is achieved by creating a local copy of
cntx and copying complex blocksizes onto real blocksizes.
- By using the dynamic blocksizes, the changes made to the
blocksizes for mixed precision are overwritten by changes made
by dynamic blocksizes.
- This mismatch between complex and real blocksizes is causing a issue
where the pack buffer is allocated based on complex blocksizes but
amount of data packed is based on real blocksizes.
- This makes the pack buffer sizes smaller than the required sizes.
- To fix this, dynamic blocksizes are disabled for mixed precision.
AMD-Internal: [CPUPL-6384]
Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83
- Reduced the blocking size of 'bli_ddotv_zen_int10'
kernel from 40 elements to 20 elements for better
utilization of vector registers
- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
kernel with 'if' conditions to handle reminder
iterations. As only a single iteration is used when
reminder is less than the primary unroll factor.
- Added a conditional check to invoke the vectorized
DDOTV kernels directly(fast-path), without incurring
any additional framework overhead.
- The fast-path is taken when the input size is ideal
for single-threaded execution. Thus, we avoid the
call to bli_nthreads_l1() function to set the ideal
number of threads.
- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
kernel.
AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
libFLAME calls DAMAX kernel directly. Now that AVX512 version
has been enabled in BLIS cntx, export this symbol.
AMD-Internal: [CPUPL-5895]
Change-Id: I4c74150578f49eb643b0f68c6cc32ee2bb23bec2
- In the existing code, blocksizes for sizes where M >> K, N >> K and K < 500
were not tuned properly for cases when application would use more than
one instance of blis in parallel.
- Imporved DGEMM performane for sizes where M, N >> k by retuning blocksizes.
Such sizes are used by applications like HPL.
AMD-Internal: [SWLCSG-3338]
Change-Id: Iec17ecc53a6fabf50eedacaf208e4e74a4e21418
- Blocksizes for sizes where M >> K, N >> K and K < 500 were tuned by running
blis bench on only one MPI rank. Blocksizes tuned this way are not performing
well for all configurations.
- Retuned the blocksizes so that performance is good for such skinny sizes.
AMD-Internal: [CPUPL-6362]
Change-Id: I89c61889df2443ef6bf0e87bf89263768b5c00c1
- Implemented the feature to benchmark ?ASUMV APIs
for the supported datatypes. The feature allows to
benchmark BLAS, CBLAS or the native BLIS API, based
on the macro definition.
- Added a sample input file to provide examples to benchmark
ASUMV for all its datatype supports.
AMD-Internal: [CPUPL-5984]
Change-Id: Iff512166545687d12504babda1bd52d71a3a5755
- Corrected the format specifier setting(as macro) to not
include additional spaces, since this would cause incorrect
parsing of input files(in case they have exactly the expected
number of parameters and not more).
- Updated the inputgemm.txt file to contain some inputs that
have the exact parameters, to validate this fix.
AMD-Internal: [CPUPL-6365]
Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49
Description:
1. Support has been added to scale buffer values using both scalar and
vector scale factors before matrix add or matrix mul post-ops.
AMD-Internal: CPUPL-6340
Change-Id: Ie023d5963689897509ef3d5784c3592791e57125
- Replaced switch case with if else, lookup table for switch case
is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.
AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
- This patch reverts the previous changes that removed the enforcement
of dgemm inputs under a certain threshold to be processed by kernels
selected based on architecture ID and handled in single-threaded mode.
- This change is now forcing such small inputs to be computed in tiny
path. Previously when this check was not there, it was routing these
inputs to SUP path and causing performance regression due to framework
overhead.
AMD-Internal: [CPUPL-5927]
Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb
Description:
1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
- Bug : When configuring our library with the native
BLIS integer size being 32, the bench application
would crash or read an invalid value when parsing
the input file. This is because of a mismatch
of format specifier, that we hardset in the
Makefile.
- Fix : Defined a header that sets the format specifiers
as macros with the right matching, based on how we
configure and build the library. It is expected to
include this header in every source file for
benchmarking.
AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
In case the executable to obtain the BLIS library version fails,
catch and report common errors to help with debugging.
Also correct the test for bli_info_get_info() support to mark
that it is not available in any AOCL version <= 4.1
AMD-Internal: [CPUPL-4500]
Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8
- Guarded the inclusion of thresholds(configuration
headers) using macros, to maintain uniformity in
the design principles.
- Updated the threshold macro names for every
micro-architecture.
AMD-Internal: [CPUPL-5895]
Change-Id: I9fd193371c41469d9ef38c37f9c055c21457b56c
Description:
1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
Where the inputs are uint8/int8 and the processing is done using
VNNI but the output is stored in f32 and bf16 formats. All the int8
kernels are reused and updated with the new output data types.
2. Added F32 data type support in bias.
3. Updated the bench and bench input file to support validation.
AMD-Internal: SWLCSG-3335
Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Using 'if' condition instead of 'for'loop to handle fringe
cases. 'for' loop is redundant for handling reminder iterations
as only a single iteration is used when reminder is less than
primary unroll factor.
AMD-Internal: [CPUPL-5594]
Change-Id: I8cebc037742ee47961869e22e2471e550fcd99e9
- Added support for gemv kernels unit test in gtestsuite.
- Added micro-kernel tests and memory tests for DGEMV
transpose case kernels.
AMD-Internal: [CPUPL-5835]
Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
AVX2 kernels for Zen1/2/3 architectures. These kernels are written
from the ground up and are independent of fused kernels.
- The DGEMV primary kernel processes the calculation in chunks of
8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
kernels, which are invoked by the primary kernel as needed.
- Implemented the kernels by computing the dot product of matrix A
columns with vector x in chunks of 32 elements, storing the results
in accumulator registers. Fringe elements are handled in chunks
of 16, 8, etc. The data in the accumulator registers is then reduced
and added to vector y.
AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.
AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
incurred when the single-threaded path is optimally performant.
AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
Details:
- Disabled intrinsics code of f32obf16 pack function
for gcc < 11.2 as the instructions used in kernels
are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
batch_gemm API for bf16 datatype.
AMD Internal: [CPUPL-6327]
Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
We want bli_thread_get_num_threads() and bli_thread_get_*_nt()
to report the threading values modified to reflect what will
be in effect given OpenMP nesting and active levels. This was
lost in commit 0c6d006225 for
bli_thread_get_num_threads() and wasn't previously implemented
in bli_thread_get_*_nt()
AMD-Internal: [CPUPL-6168]
Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd
Description:
Added _mm512_cvtps_epi32 for bf16 to s32 conversion in gemv APIs.
AMD-Internal: SWLCSG-3302
Change-Id: I7e3e6da8f50d1f7177629cb68ac21e3bbce40bee