-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dscal’
internally invokes the BLAS API ‘dscal_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemv’
internally invokes the BLAS API ‘dgemv_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemm’
internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
- BLIS uses callback function to report the progress of the
operation. The callback is implemented in the user application
and is invoked by BLIS.
- Updated callback function prototype to make all arguments const.
This will ensure that any attempt to write using callback’s
argument is prevented at the compile time itself.
AMD-Internal: [CPUPL-2504]
Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.
AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.
AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for
all Zen4 models, based on family and AVX512 feature tests.
AMD-Internal: [CPUPL-2474]
Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.
Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
- int16 c matrix intermediate values are converted to int32,
then the int32 values are converted to fp32. On these fp32
values scaling is done
- The resultant value is down scaled to int8 and stored in a
separate buffer
AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.
AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
Details:
1. In zen4 dgemm and sgemm native kernels are column-prefer kernels,
cgemm and zgemm kernels are row-prefer kernels. zen3 and older arch
(uses row-prefer kernels for all datatypes). Induced-transpose carried
out based on kernel preference check in both crc and ccr. we don't
require kernel preference check to apply Induced-transpose.
2. In case of ccr, B matrix is real. We must Induce a transposition and
perform C+=A*B (crc). where A (formerly B) is real.
3. In case of crc, A matrix is real. We don't require any
Induced-transpose.
AMD-Internal: [CPUPL-2440] [CPUPL-2449]
Change-Id: I44c53a20c8def7ddbb84797ba20260acec2086a2
Details:
1. Changes are made in ztrsm small MT path,to avoid accuracy issues
reported in libflame tests.
AMD-Internal: [CPUPL-2476]
Change-Id: Ic279106343fb1744e89ff4c920023adbe1d0158a
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.
- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support
Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
-OpenMP based multi-threading support added for BFloat16 gemm.
Both gemm and reorder api's are parallelized.
-Multi-threading support for u8s8s16 reorder api.
-Typecast issues fixed for bfloat16 gemm kernels.
AMD-Internal: [CPUPL-2459]
Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109
Details:
1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction
is used to load B matrix in k loop and k last loop,
which is loading 128 bit into xmm than 64 bit as expected.
2. Changed vmovps instruction to vmovsd instrucntions
which load only 64 bit in xmm register
3. Avoided C memory access by vfma instruction when multiplying
with non-beta at corner cases with required access to 128 bit
which leads to out of bound. Replaced with vmovq first to
get 64 bit data then peformed vfma on xmm register in rv_6x8m
and rv_6x4m
AMD-Internal: [CPUPL-2472]
Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
only defined in header file if BLIS_ENABLE_OPENMP is defined
AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.
AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it.
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8d81072aaca739f175318b82f6510d386103c24b
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
files
AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and may
result in recursion
- This change separate the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a
- Performance of u8s8s16os16 came down by 40% after the
introduction of post-ops
- Analysis revealed that the target compiler assumed false
dependency and was generating sub-optimal code due to the
post-ops structure
- Inserted vzeroupper to hint the compiler that no ISA change
will occur
AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
- In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
- If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and may
result in recursion
- This change separates the CBLAS and BLAS implantation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I8380b6468683028035f2aece48916939e0fede8a
->In BLIS the cblas interface is implemented as a wrapper around
the blas interface. For example the CBLAS api ‘cblas_dgemm’
internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
and BLAS, current implantation of BLIS doesn’t allow it and
may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
and additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
Details:
1. Fixed the issues corresponding to Out of bound memory access during
load and store.
2. In Intrinsic code:
i. AVX2 Registers can hold 4 double elements.
ii. In case of remainder when number of elements is lessthan
vectorised register. Though the required number of elements are lessthan
4, we are reading and writing in chunks of 4 elements due to
vectorization. This might cause out of bound memory access.
3. Redesigned code to restrict out of bound access by loading and
storing the exact number of elements required.
AMD-Internal: [SWLCSG-1470]
Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b
- Checking for AVX512 bfloat 16 instructions support in
architecture using the CPUID
AMD-Internal: [CPUPL-2446]
Change-Id: I088a8aa46b037af837b2e58a96b59eae70c1dbf0
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.
AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.
1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided
Change-Id: If5d08213068869eff060c9998596d2d2703a6793
Details:
1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios.
2. Enabled multithreading execution for bli_trsm_small implementation
for double complex data type.
3. Added decision logic to choose between native vs multi-threaded small
path for sizes upto 500 and threads upto 8.
AMD-Internal: [CPUPL-2340]
Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1
1. In zen4 dgemm and sgemm native kernels are column-prefer kernels,
cgemm and zgemm native kernels are row-prefer kernels. zen3 and older
arch (uses row-prefer kernels for all datatypes) hence induced-transpose
carried out based on kernel preference check. Added a condition check,
output matrix storage format need to be checked along with kernel
preference to avoid induced-transpose for zen4.
2. Added functions bli_cntx_l3_vir_ukr_dislikes_storage_of_md,
bli_cntx_l3_vir_ukr_prefers_storage_of_md for checking output matrix
storage format and micro kernel preference of mixed datatypes.
AMD-Internal: [CPUPL-2347]
Change-Id: Ib77676f4e2152f7876ad7dc91de716547f5ba3a5
Details:
1. For lower and upper, "B" column major storage variants of gemmt,
new kernels are developed and optimized to compute only the
required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
6x8 block of C matrix are computed and stored into a temporary
buffer. Later,the required elements are copied into the final C
output buffer.
3. Changes are made to compute only the required outputs of the 6x8
block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.
AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names
will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is
now "generic", so that it should be constant as new code paths are
added. Thus all other code path enum values have increased by 2.
Also added new options to BLIS configure program to allow:
1. BLIS_ARCH_TYPE functionality to be disabled, e.g.:
./configure --disable-blis-arch-type amdzen
2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a
specified value, e.g.:
./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen
On Windows, these can be enabled with e.g.:
cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON
or
cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE
This implements changes 2 and 3 in the Jira ticket below.
AMD-Internal: [CPUPL-2235]
Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4
Details:
1. Runtime Thread Control Feature is enhanced to create a provision
for the application to allocate a different number of threads to
BLIS from the number of threads application is using for itself.
2. In the previous implementation, if application sets BLIS_NUM_THREADS
with a valid value, BLIS internally calls omp_set_num_threads() API with
same value. Due to this, application could not differentiate between
the number of threads used in BLIS library and the application.
3. With the current solution, if Application wants to allocate
different number of threads for BLIS API and application, Application
can choose either BLIS_NUM_THREADS environment variable or
bli_thread_set_num_threads(nt) API for BLIS,
and OpenMP APIs or environment variables for itself, respectively.
4. If BLIS_NUM_THREADS is set with a valid value, same value
will be used in the subsequent parallel regions unless
bli_thread_set_num_threads() API is used by the Application
to modify the desired number of threads during BLIS API execution.
5. Once BLIS_NUM_THREADS environment variable or
bli_thread_set_num_threads(nt) API is used by the application,
BLIS module would always give precedence to these values. BLIS API would
not consider the values set using OpenMP API omp_set_num_threads(nt) API
or OMP_NUM_THREADS environment variable.
6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and
issued omp_set_num_threads(nt) with desired number of threads,
omp_get_max_threads() API will fetch the number of threads set earlier.
7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called
by the application, but only OMP_NUM_THREADS is set,
omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS.
8. If both environment variables are not set, or if they are set with
invalid values, and omp_set_num_threads(nt) is not issued by
application, omp_get_max_threads() API will return the number of the
cores in the current context.
9. BLIS will initialize rntm->num_threads with the same value.
However if omp_set_nested is false - BLIS APIs called from parallel
threads will run in sequential. But if nested parallelism is enabled
Then each application will launch MT BLIS.
10. Order of precedence used for number of threads:
0. value set using bli_thread_set_num_threads(nt) by the application
1. valid value set for BLIS_NUM_THREADS environment variable
2. omp_set_num_threads(nt) issued by the application
3. valid value set for OMP_NUM_THREADS environment variable
4. Number of cores
11. If nt is not a valid value for omp_set_num_threads(nt) API,
number of threads would be set to 1.
omp_get_max_threads() API will return 1.
12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled.
AMD-Internal: [CPUPL-2342]
Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6
-Updated optimal threads in zgemm sup path for skinny matrices.
-Fine tuned the threshold values for small and sup paths
to improve overall zgemm.
-Zgemm small is selected for inputs with transb as N.
-Redirection of input among small, sup and native path
was fine tuned.
AMD-Internal : [CPUPL-1900]
Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f
Details:
1. Optimized the kernels by replacing the macros with
the actual computation of required output elements.
AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
Details:
1. In kernels for non-transpose variants, changes
are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
GTestSuite(Functionality, Valgrind, Integer Tests)
and Netlib Tests.
3. Fixed warnings during the build process.
AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
Implement netlib BLAS style quick return functionality for when no
work is required. Similar functionality was already in HERK and HER2K
routines.
AMD copyrights updated.
AMD-Internal: [CPUPL-2373]
Change-Id: I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c
-sup GEMM - 2 variants var2m (block-panel) and var1n (panel-block).
We added decision logic to choose between var1n and var2m for single
thread SGEMM.var1n is favorable option when "n" is very large compared
to "m".
-Also fixed a bug related to fetching "MR" "NR" values in
bli_gemmsup_int(). We replaced "bli_cntx_get_blksz_def_dt()(used for
Native)" with "bli_cntx_get_l3_sup_blksz_def_dt()".
AMD-Internal: [CPUPL-2406]
Change-Id: If36529015b1c5f8f87eb40c05ebcf433c471d4d5
- Incase of DGEMM when m, n and leading dimensions are large
packing of A and B matrixes are required for optimal performance.
- Modified decision logic to choose between sup vs native,
now apart from matrix dimensions, we also incorporate matrix
leading dimensions into this decision.
AMD-Internal: [CPUPL-2366]
Change-Id: I255db5f7049d783e22d7c912edf8bbf023e32ed8
Details:
- When work distributed for each thread is larger than caches,
it is advisable to perform packing of B for sup dgemm.
- Work distribution per thread is calculated based on the values
of jc_nt and ic_nt.
- For RRC and CRC cases we want to avoid rd kernels which are not
efficient in performance compared to rv kernels. Therefore we
perform packing of A as well so that rv kernels are invoked for
these cases.
- These changes result in improved DGEMM performance.
- Dynamic packing is done using the API "bli_rntm_set_pack_b( 1, rntm )"
Change-Id: I8344520b4a2591e57518bb54183a15957f60f94b
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and
dgemm_small kernel.
AMD-Internal: [CPUPL-2366]
Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
1. Fixed the memory leaks in corner cases which caused due to extra
loads in all datatypes(s,d,c,z).
2. In remainder cases instead of loading required number of elements,
loaded extra elements which lead to memory leaks. Fixed memory leaks by
restricting number of loads to required number of elements.
AMD-Internal: [CPUPL-2280]
Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.
AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3