- Broke down the KR loop inside the compute kernel into
two pieces
- Added C matrix prefetch between the two decomposed
pieces of KR loop
AMD-Internal: [CPUPL-2693]
Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e
- NC value is adjusted based on the n dimension of the B
matrix.
- KC value is adjusted based on the k dimension input
matrices.
- In order to ensure a handshake between the reorder
function and compute function, the same dynamic block
tuning logic is called in both places.
- Reorder followed by compute scenarios mean that tuning
KC value based on MC value and vice versa is not feasible.
AMD-Internal: [CPUPL-2638]
Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429
-Enabled the sup path for zgemm in case of single thread.
-Fine tuned the thresholds for small path in order to handle
certain spectrum of skinny matrices.
-The inputs are now redirected among small, sup and native
paths in case of single thread.
AMD-Internal: [CPUPL-2632]
Change-Id: I3eca315bcb8bfa8e015349a7b2bc4ebd5448e065
1. Check OpenMP active level against max active levels when setting
number of threads for starting a new parallel region in
./frame/thread/bli_thread.c to ensure the correct number of threads
is used when BLIS is called within nested OpenMP parallelism.
2. In subsequent BLIS calls, threading choices could be incorrectly
set based on values used and stored in global_rntm by a previous
call. This could apply when the OpenMP number of threads differ from
call to call, different nested parallelism is used in different
parts of a user's code, or different threads at the user level
request different numbers of OpenMP threads for BLIS calls.
Keep threading information in both global_rntm and a new Thread
Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
environment (as appropriate) during bli_init_auto() calls in each
BLIS routine. The details are:
* global_rntm is initialized on first BLIS call based on OpenMP and
BLIS threading environment variables.
* global_rntm is updated by any BLIS threading function calls.
* In bli_thread_update_tl(), called by bli_init_auto(), sync with
any BLIS values set or updated in global_rntm. Then, if BLIS
threading control is not used, check OpenMP ICVs and set thread
count and auto_factor appropriately.
* Setting BLIS threading locally (using expert interfaces to pass
a user defined rntm data structure) should work as before.
3. bli_thread_get_is_parallel can now only be called outside of
parallelism within BLIS routines. Change calls in trsm to reflect
this.
4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
if any BLIS_*_NT environment variables are set.
5. Set auto_factor = FALSE when the number of threads is 1.
6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.
7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().
8. For debugging, internal information on the rntm threading data can
be printed by defining "PRINT_THREADING" at the top of bli_rntm.h
9. bli_rntm_print() now also prints the value of blis_mt.
10. Function prototypes in bli_rntm.h moved to top of file, so that
bli_rntm_print() can be used within inline functions defined in
this header file.
11. Comment out bli_init_auto() and bli_finalize_auto() calls in
Fortran interfaces in frame/compat/blis/thread/b77_thread.c
12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
set_pack_b functions outside of the auto_factor if statements.
13. Misc code tidying.
AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
-Disabled the sup path for zgemm with single thread temporarily
until we tune the performance and update the thresholds.
The redirection of inputs in zgemm is now among small and native path
for single thread.
-Fine tuned the input size constraints for entry to small path in zgemm
in case of single thread.
AMD-Internal: [CPUPL-2495]
Change-Id: Ifa19116757ca132d9636c3f2a0661524d2c1d3ec
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]
AMD-Internal: [LWPZENDNN-493]
Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
1. Fixed accuracy issues in dtrsm small multi-thread.
2. Fixed out of bound memory accesses in the patch
"Fixes to avoid Out of Bound Memory Access in TRSM small algorithm"
3. Re-enabled DTRSM small MT with above fixes.
AMD-Internal: [CPUPL-2567]
Change-Id: Ibf2949b25fde4007a92cc635526bed0e0d897800
Description:
- When the value of the result in s8 for u8s8s32 and u8s8s16 are
close to 0. Values are getting ceiled to 1.
- Used nearbyintf to round the downscaled values in bench reference.
This fixed the result mismatch issue between the vectorized kernel
implementation and reference implementation in bench accuracy test.
AMD-Internal: [CPUPL-2617]
Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82
Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in
the netlib LAPACK tests, specifically in:
./xlintstd < ../dtest.in > dtest.out
in the TESTING/LIN directory. Given time constraints, i.e. the need to
finalize code for AOCL 4.0 release, disable calls to AVX512 kernel
(i.e. always use the AVX2 kernel) for now, and aim to correct
bli_damaxv_zen_int_avx512 for AOCL 4.1.
AMD-Internal: [CPUPL-2590]
Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.
AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
Description:
1. n value per thread and offset are defined outside omp loop
which can vary for each thread. Moved them to inside the
omp loop so that the output
2. Fixed few compilation warnings in dnrm2 avx2 implemenation
AMD Internal:[CPUPL-2606]
Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
Address sanitizer reports error when rbp regitser is modified.
Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.
This fix address the issue reported by GEQP3 API of libflame
AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
- Cleaned up some white space in the AVX2 implementation for DNRM2.
AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
Description:
1. Calculate the optimal number of threads required based on input
dimension.
2. Request dynamic memory from memory pool to store the each
thread output.
3. Create optimal number of threads and divide the given input data
among all threads.
4. Accumulate the results in rho variable.
5. Free the dynamically allocated memory back to memory pool.
AMD-Internal: [CPUPL-2560]
Change-Id: I30982b622d531b29d3570b2f5f7569bd1d951d68
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.
AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.
AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.
AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.
AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
- Added multithreaded support for dscalv.
- Optimal number of threads is calculated based on the input size.
AMD-Internal: [CPUPL-2583]
Change-Id: I0a253c28cb389a4f5214439dbca62304888ca5ae
Details:
- Updated Makefile and common.mk so that the targeted configuration's
kernel CFLAGS are applied to source files that are found in a
'kernels' subdirectory within an enabled addon. For now, this
behavior only applies when the 'kernels' directory is at the top
level of the addon directory structure. For example, if there is an
addon named 'foobar', the source code must be located in
addon/foobar/kernels/ in order for it to be compiled with the target
configurations's kernel CFLAGS. Any other source code within
addon/foobar/ will be compiled with general-purpose CFLAGS (the same
ones that were used on all addon code prior to this commit). Thanks
to AMD (esp. Mithun Mohan) for suggesting this change and catching an
intermediate bug in the PR.
- Comment/whitespace updates.
(cherry picked from commit fd885cf98f)
Change-Id: I9a678f78bde90b23a6293ce90377004876f51067
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.
This fix address the issue reported for follwoing libflame APIs
geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd
AMD-Internal: [CPUPL-2587]
Change-Id: I63749c7d406c7339d2b45b0488108ccd3f90a248
1. Correcting the type of alpha, and beta values from C_type
(output type) to accumulation type.
For the downscaled LPGEMM APIs, C_type will be the downscaled
type but the required type for alpha and beta values should
be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
(bf16bf16f32of32 and bf16bf16f32obf16) APIs
AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.
This fix address the issue reported for follwoing libflame APIs
geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd
AMD-Internal: [CPUPL-2587]
Change-Id: Ib1e9f89bc4b6911e4b7e9b910d1ecc2ba00b286a
1. Corrected B buffer accessing to access by its offset instead of
starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
dimension based on side right or left. Hence need to access by its
offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
complete matrix A is used by all threads.
Incase of design change in future, modified A buffer accessing by its
offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]
Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
- Updated zen4 configuration to enable AVX512 flags for the
reference kernels
- Reference and vector kernels will use the same compiler flags
AMD-Internal: [CPUPL-2533]
Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8
- Removed some read code from the macros for downscale
- Store permute correction
- Simplified macros for edge cases and corrected intermediate
operation
AMD-Internal:[CPUPL-2171]
Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.
AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
- Removed prototypes for float(sgemm3m) and double(dgemm3m) types,
as BLIS implements this API only for scomplex(cgemm3m) and
dcomplex(zgemm3m)
AMD-Internal: [SWLCSG-1477]
Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.
- Scaling is used to avoid overflow and underflow.
- Works correctly for negative increments.
AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
Details:
1. Changes are made in dtrsm small MT path,to avoid accuracy issues.
AMD-Internal: [SWLCSG-1470]
Change-Id: I65237225892f97b7222fe71f66b02841b5956560
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dscal’
internally invokes the BLAS API ‘dscal_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemv’
internally invokes the BLAS API ‘dgemv_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
-In BLIS, the CBLAS interface is implemented as a wrapper around
the BLAS interface. For example the CBLAS API ‘cblas_dgemm’
internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
user from overriding them individually by the application or
other libraries.
-This change separates the CBLAS and BLAS implementation by adding
an additional level of abstraction. The implementation of the
API is moved to the new function which is invoked directly from
the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
- BLIS uses callback function to report the progress of the
operation. The callback is implemented in the user application
and is invoked by BLIS.
- Updated callback function prototype to make all arguments const.
This will ensure that any attempt to write using callback’s
argument is prevented at the compile time itself.
AMD-Internal: [CPUPL-2504]
Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.
AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.
AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for
all Zen4 models, based on family and AVX512 feature tests.
AMD-Internal: [CPUPL-2474]
Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.
Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
- int16 c matrix intermediate values are converted to int32,
then the int32 values are converted to fp32. On these fp32
values scaling is done
- The resultant value is down scaled to int8 and stored in a
separate buffer
AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.
AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
Details:
1. In zen4 dgemm and sgemm native kernels are column-prefer kernels,
cgemm and zgemm kernels are row-prefer kernels. zen3 and older arch
(uses row-prefer kernels for all datatypes). Induced-transpose carried
out based on kernel preference check in both crc and ccr. we don't
require kernel preference check to apply Induced-transpose.
2. In case of ccr, B matrix is real. We must Induce a transposition and
perform C+=A*B (crc). where A (formerly B) is real.
3. In case of crc, A matrix is real. We don't require any
Induced-transpose.
AMD-Internal: [CPUPL-2440] [CPUPL-2449]
Change-Id: I44c53a20c8def7ddbb84797ba20260acec2086a2
Details:
1. Changes are made in ztrsm small MT path,to avoid accuracy issues
reported in libflame tests.
AMD-Internal: [CPUPL-2476]
Change-Id: Ic279106343fb1744e89ff4c920023adbe1d0158a