Display CC, CXX compiler information and version details to help users
identify which compiler was used to build BLIS.
Changes:
- Makefile: Add CC, CXX, and CFLAGS to 'make showconfig' output
- bench_getlibraryInfo.c: Add runtime compiler detection with version
* Properly detect AOCC (AMD Optimizing C/C++ Compiler)
* Detect Clang, GCC, Intel ICC/oneAPI, and MSVC
* Show compiler version information
* Detection order ensures AOCC is identified correctly (not as GCC)
The showconfig output now includes:
CC (C compiler): clang
CXX (C++ compiler): clang++
CFLAGS: <flags>
The bench_getlibraryInfo program now shows:
== Build Information ==
C Compiler (CC): AOCC x.x.x (Clang x.x.x based)
C++ Compiler (CXX): N/A (compiled as C)
This provides both build-time and runtime compiler information for
better build diagnostics and issue reporting.
Issue: CPUPL-7678
Lots of the bench applications were not taking n_repeats as an argument on the command line (while others were).
Also adding the feature to skip the rest of the line (at the end) while reading bench inputs for all APIs
---------
Co-authored-by: Rayan <rohrayan@amd.com>
* DTL Log update
Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv
Modified bench_gemv.c to able to process modified dtl logs.
* Updated DTL log for copy routine with actual nt and dynamic nt
* Refactor OpenMP pragmas and clean up code
Removed unnecessary nested OpenMP pragma and cleaned up function end comment.
* Fixed DTL log for sequential build
* Added thread logging in bla_gemv_check for invalid inputs
---------
Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>
Some files have copyright statements but not details of the license.
Add this to DTL source code and some build and benchmark related
scripts.
AMD-Internal: [CPUPL-6579]
- Change begin_asm and end_asm comments and unused code in files
kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
to avoid problems in clobber checking script.
- Add missing clobbers in files
kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.
AMD-Internal: [CPUPL-6579]
Bug Fix in tpsv and tpmv - integer overflow
When BLAS integer size is 32 bits which is the case for LP64 binaries
the computation of "n * (n+1) / 2" will cause overflow due to overflow
of the operation "n * (n + 1)". This overflow is now fixed.
Fix: Replace bla_integer with dim_t for variables used for indexing
packed matrix (AP) in TPSV and TPMV functions. This prevents
overflow when computing kk = n*(n+1)/2 for large matrices.
In addition we added new file under bench called bench_getlibraryInfo.c which prints
all the information related to blis binary.
- Fixes ctpsv_, dtpsv_, stpsv_, ztpsv_
- Fixes ctpmv_, dtpmv_, stpmv_, ztpmv_
- Maintains BLAS compatibility
- A matrix by default isn't expected to be packed for a normal row-stored
case. Hence the packing implementation is incomplete.
- But if the user explicitly enables packing, interface wasn't handling
the condition appropriately leading to data overwriting inside the incomplete
pack kernels, thereby leading to accuracy failure.
- As a fix, updated the interface to set the explicit PACK A to UNPACKED and
proceed with GEMM in cases where transpose of A is not necessary.
- Updated the batch gemm input file with additional test cases covering all the
APIs.
Bug Fixes:
- Fixed implementation logic for column major inputs with post-ops to be disabled
in S8 batch mat-mul. With the existing implementation, column-major inputs wouldn't
be executed in case of of32/os32 inputs.
- Fixed the Scale/ZP calculation in bench foru8s8s32ou8 condition, which was leading
to accuracy failures.
[AMD-Internal: CPUPL-7283 ]
- Removed duplicate calls to BATCH_GEMM_CHECK().
- Refactored freeing of post-op pointer in bench code and verified the
functionality.
- Modified indexing of the array to take the correct values.
Fixing a bug in some bench applications where GFLOPS computation ran into integer overflows because explicit type casting to double was not done in the computation
removing all multiplies by 1.0 during GFLOP computation
AMD-Internal: CPUPL-7016
---------
Co-authored-by: Rayan <rohrayan@amd.com>
Updated poly Gelu Erf precision to double to keep the error with in 1e-5 limit when compared to reference gelu_erf, which is also increased the compute to 2x compared to float.
AMD-Internal: SWLCSG-3551
* Updated aocl_batch_gemm_ APIs aligning to CBLAS batch API.
- Modified Batch-Gemm API to align with cblas_?gemm_batch_ API,
and added a parameter group_size to the existing APIs.
- Updated bench batch_gemm code to align to the new API definition.
- Modified the hardcoded number in lpgemm_postop file.
- Added necessary early return condition to account for group_count/group_size < 0.
AMD-Internal: [ SWLCSG - 3592 ]
- In U8 GEMV n=1 kernels, the default zp condition was S8 ZP type,
which leads to accuracy issues which u8s8s32u8 API is used.
- Few modifications in bench code to take the correct path for
accuracy check.
Details:
- In FP32 GEMM, when threading is disabled, rntm_pack_a and rntm_pack_b
were set to true by default. This leads to perf regression for smaller
sizes. Modified FP32 interface API to not overwrite the packA and
packB variables in rntm structure.
- In FP32 GEMV, Removed the decision making code based on mtag_A/B
and should_pack_A/B for packing. Matrices will be packed only
if the storage format of the matrices doesn't match the storage
format required by the kernel.
- Changed the control flow of checking the value of mtag to whether
matrix is "reordered" or "to-be-packed" or "unpacked". checking
for "reorder" first, followed by "pack". This will ensure that
packing doesn't happen when the matrix is already reordered even
though user forces packing by setting "BLIS_PACK_A/B"
-Modified python script to generate testcases based on block sizes
AMD-Internal: SWLCSG-3527
Support for S32 Zero point type is added for aocl_gemm_s8s8s32os32_sym_quant
Support for BF16 scale factors type is added for aocl_gemm_s8s8s32os32_sym_quant
U8 buffer type support is added for matadd, matmul, bias post-ops in all int8 APIs.
AMD-Internal: SWLCSG-3503
Since the gnu extensions where removed, executables in bench directory cannon be built correctly.
The fix is adding "-D_POSIX_C_SOURCE=200112L" on those targets. When -std=gnu99 was used,
bench worked without this flag, but that was not the case since we switched to -std=c99.
Description
1. In the cases of clip, swish, and relu_scale, constants are currently
loaded as float. However, they are of C type, so handling has been
adjusted, for integer these constants are first loaded as integer
and then converted to float.
Change-Id: I176b805b69679df42be5745b6306f75e23de274d
- Currently the int8/uint8 APIs do not support multiple ZP types,
but works only with int8 type or uint8 type.
- The support is added to enable multiple zp types in these kernels
and added additional macros to support the operations.
- Modified the bench downscale reference code to support the updated
types.
AMD-Internal : [ SWLCSG-3304 ]
Change-Id: Ia5e40ee3705a38d09262086d20731e8f0a126987
- Added FP32 RD (dot-product) kernels for both, AVX512 and AVX2 ISAs.
- The FP32 AVX512 primary RD kernel has blocking of dimensions 6x64
(MRxNR) whereas it is 6x16 (MRxNR) for the AVX2 primary RD kernel.
- Updatd f32 framework to accomodate rd kernels in case of B trans
with thresholds
- Updated data gen python script
TODO:
- Post-Ops not yet supported.
Change-Id: Ibf282741f58a1446321273d5b8044db993f23714
- Updated the S8 main, GEMV, m_, n_ and mn_ fringe kernels to support
multiple scale types for vector and scalar scales
- Updated the U8 main, GEMV, m_, n_, extMR_ and mn_ fringe kernels to
support multiple scale types for vector and scalar scales
- Updated the bench to accommodate multiple scale type input, and
modified the downscale_accuracy_check_ to verify with multiple scale
type inputs.
AMD Internal: [ SWLCSG-3304 ]
Change-Id: I7b9f3ec8ea830d3265f72d18a0aa36086e14a86e
Details:
- Setting post_op_grp to NULL at the start of post-op
- creator to ensure that there is no junk value(non-null)
- which might lead to destroyer trying to free
- non-allocated buffers.
AMD-Internal: [SWLCSG-3274]
Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c
Details:
- Group quantization is technique to improve accuracy
where scale factors to quantize inputs and weights
varies at group level instead of per channel
and per tensor level.
- Added new bench files to test GEMM with symmetric static
quantization.
- Added new get_size and reorder functions to account for
storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.
AMD-Internal:[SWLCSG-3274]
Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
Details:
- Fixed the logic to identify an API that has int4 weights in
bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
zen4 kernels and replaced them with masked load instruction.
This ensures that the load register will be populated with
zeroes at locations where mask is set to zero.
Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.
AMD-Internal: [SWLCSG-3390]
Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
Description
- Zero point support for <s32/s8/bf16/u8> datatype in element-wise
postop only f32o<f32/s8/u8/s32/bf16> APIs.
AMD-Internal: [SWLCSG-3390]
Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
Description
Due to different datatype for zero point during post-op creation
and accuracy check we see an accuracy issue for u8/s8s8s32 apis
with output type f32/bf16.
AMD-Internal: [CPUPL-6456]
Change-Id: If8925988841af87cb5687c84aade607967c744fe
Description:
1. When compiler gcc version less than 11.2 few BF16 instructions
are not supported by the compiler even though the processors arch's
zen4 and zen5 supports.
2. These instructions are guarded now with a macro.
Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
- Currently the BF16 kernels uses the AVX512 VNNI instructions.
In order to support AVX2 kernels, the BF16 input has to be converted
to F32 and then the F32 kernels has to be executed.
- Added un-pack function for the B-Matrix, which does the unpacking of
the Re-ordered BF16 B-Matrix and converts it to Float.
- Added a kernel, to convert the matrix data from Bf16 to F32 for the
give input.
- Added a new path to the BF16 5LOOP to work with the BF16 data, where
the packed/unpacked A matrix is converted from BF16 to F32. The
packed B matrix is converted from BF16 to F32 and the re-ordered B
matrix is unre-ordered and converted to F32 before feeding to the
F32 micro kernels.
- Removed AVX512 condition checks in BF16 code path.
- Added the Re-order reference code path to support BF16 AVX2.
- Currently the F32 AVX-2 kernels supports only F32 BIAS support.
Added BF16 support for BIAS post-op in F32 AVX2 kernels.
- Bug fix in the test input generation script.
AMD Internal : [SWLCSG - 3281]
Change-Id: I1f9d59bfae4d874bf9fdab9bcfec5da91eadb0fb
Description:
1. Added new output types for f32 element wise API's to support
s8, u8, s32 , bf16 outputs.
2. Updated the base f32 API to support all the post-ops supported in
gemm API's
AMD Internal: [SWLCSG-3384]
Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
-Currently the BF16 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
-Only row major inputs are allowed to proceed with tiny GEMM.
AMD-Internal: [SWLCSG-3380, SWLCSG-3258]
Change-Id: I9dfa6b130f3c597ca7fcf5f1bc1231faf39de031
Details:
- Added a new python script that can test all microkernels
along with post-ops.
- Modified post_op freeing function to avoid memory leaks.
Change-Id: Iedba84e8233a88ca9261596c4c7e0a65c196b7e7
-Currently the F32 API uses the 5 loop algorithm inside the OMP loop
to compute the results, irrespective if the input sizes. However it
was observed that for very tiny sizes (n <= 128, m <= 36), this OMP
loop and NC,MC,KC loops were turning out to be overheads.
-In order to address this, a new path without OMP loop and just the
NR loop over the micro-kernel is introduced for tiny inputs. This is
only applied when the num threads set for GEMM is 1.
AMD-Internal: [SWLCSG-3380]
Change-Id: Ia712a0df19206b57efe4c97e9764d4b37ad7e275
- Updated the format specifiers to have a leading space,
in order to delimit the outputs appropriately in the
output file.
- Further updated every source file to have a leading space
in its format string occuring after the macros.
AMD-Internal: [CPUPL-5895]
Change-Id: If856f55363bb811de0be6fdd1d7bbc8ec5c76c15
Description:
1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
- Modified bench to support testing of different types of buffers
for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
type.
Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.
AMD-Internal: [SWLCSG-3333]
Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9
- Implemented the feature to benchmark ?ASUMV APIs
for the supported datatypes. The feature allows to
benchmark BLAS, CBLAS or the native BLIS API, based
on the macro definition.
- Added a sample input file to provide examples to benchmark
ASUMV for all its datatype supports.
AMD-Internal: [CPUPL-5984]
Change-Id: Iff512166545687d12504babda1bd52d71a3a5755
- Corrected the format specifier setting(as macro) to not
include additional spaces, since this would cause incorrect
parsing of input files(in case they have exactly the expected
number of parameters and not more).
- Updated the inputgemm.txt file to contain some inputs that
have the exact parameters, to validate this fix.
AMD-Internal: [CPUPL-6365]
Change-Id: Ie9a83d4ed7e750ff1380d00c9c182b0c9ed42c49
Description:
1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
- Bug : When configuring our library with the native
BLIS integer size being 32, the bench application
would crash or read an invalid value when parsing
the input file. This is because of a mismatch
of format specifier, that we hardset in the
Makefile.
- Fix : Defined a header that sets the format specifiers
as macros with the right matching, based on how we
configure and build the library. It is expected to
include this header in every source file for
benchmarking.
AMD-Internal: [CPUPL-5895]
Change-Id: I9718c36a1a9fe3eba4d5da419823c16097902d89
Description:
1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
Where the inputs are uint8/int8 and the processing is done using
VNNI but the output is stored in f32 and bf16 formats. All the int8
kernels are reused and updated with the new output data types.
2. Added F32 data type support in bias.
3. Updated the bench and bench input file to support validation.
AMD-Internal: SWLCSG-3335
Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
-When A matrix is packed, it is packed in blocks of MRxKC, to form a
whole packed MCxKC block. If the m value is not a multiple of MR, then
the m % MR block is packed in a different manner as opposed to the MR
blocks. Subsequently the strides of the packed MR block and m % MR
blocks are different and the same needs to be updated when calling the
GEMV kernels with packed A matrix.
-Fixes to address compiler warnings.
AMD-Internal: [SWLCSG-3359]
Change-Id: I7f47afbc9cd92536cb375431d74d9b8bca7bab44
Details:
- Disabled intrinsics code of f32obf16 pack function
for gcc < 11.2 as the instructions used in kernels
are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
batch_gemm API for bf16 datatype.
AMD Internal: [CPUPL-6327]
Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
type(s8/s32/bf16) irrespective of the storage data type in
u8s8s32/s8s8s32 API's.
AMD-Internal: SWLCSG-3302
Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4
Details:
- The batch matmul performs a series of matmuls, processing
more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
problems equally irrespective of their dimensions which
leads to better performance for batches with identical GEMMs
but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.
AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.
AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
- Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
- Added the optional support for B-prefetch in the JIT kernels
- Resolved the visibility issues in global variable jit_krnels_generated
- Modified the array generation for scale and zp values in the bench
Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d