amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 01:30:00 +00:00

Author	SHA1	Message	Date
Mithun Mohan	8d8a8e2f19	Light-weight logging framewok for LPGEMM. -A light-weight mechanism/framework to log input details and a stringified version of the post-ops structure is added to LPGEMM. Additionally the runtime of the API is also logged. The logging framework logs to a file with filename following the format aocl_gemm_log_<PID>_<TID>.txt. -To enable this feature, the AOCL_LPGEMM_LOGGER_SUPPORT=1 macro needs to be defined when compiling BLIS (with aocl_gemm addon enabled) by passing CFLAGS="-DAOCL_LPGEMM_LOGGER_SUPPORT=1" to ./configure. Additionally AOCL_ENABLE_LPGEMM_LOGGER=1 has to be exported in the environment during LPGEMM runtime. AMD-Internal: [SWLCSG-3280] Change-Id: I30bfb35b2dc412df70044601b335938fc9f49cfb	2025-01-03 11:28:57 +00:00
varshav2	d4e0fa9b4c	Revert duplicate check and fix bug in the check for post-ops - Revert of patch 1110983 - Duplicate check removal and early return for s8s8s32/u8s8s32 - Add fix - Added check to see if post-ops is enabled with col-major storage and return early in that case. Change-Id: Id3b8c97b6d1425dfb06f3b196e5acd60caee8fca	2024-08-29 06:52:14 -04:00
varshav2	e3c434080a	Fix duplicate check and early return in s8s8s32/u8s8s32 - removed the duplicate check for col-major inputs in s8s8s32/u8s8s32 APIs - Fixed the print in bench_lpgemm Change-Id: If40837b89927dd82d8aa6f620d1a7f2c24aed53c	2024-08-23 02:32:20 +05:30
mkadavil	d37c91dffa	Quantization (scale + zero point) support for BF16 LPGEMM api. -Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point) instead of just type conversion in aocl_gemm_bf16bf16f32obf16. -Support for multiple scale/sum/matrix_add/bias post-ops in a single LPGEMM api call. -Post-ops mask related fixes in lpgemv kernels . -Additional scale post-ops sanity checks. AMD-Internal: [SWLCSG-2945] Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb	2024-07-18 05:32:51 -04:00
Nallani Bhaskar	29db6eb42b	Added transB in all AVX512 based int8 API's Description: --Added support for tranB in u8s8s32o<s32\|s8> and s8s8s32o<s32\|s8> API's --Updated the bench_lpgemm by adding options to support transpose of B matrix --Updated data_gen_script.py in lpgemm bench according to latest input format. AMD-Internal: [SWLCSG-2582] Change-Id: I4a05cc390ae11440d6ff86da281dbafbeb907048	2024-05-23 03:46:13 +05:30
eashdash	ef134dc49f	Added Trans A feature for all INT8 LPGEMM APIs 1. Added Trans A feature to handle column major inputs for A matrix. 2. Trans A is enabled by on-the-go pack of A matrix. 3. The on-the-go pack of A converts a column storage MCxKC block of A into row storage MCxKC block as LPGEMM kernels are row major kernels. 4. New pack routines are added for conversion of A matrix from column major storage to row major storage. 5. LPGEMM Cntx is updated with pack kernel function pointers. 6. Packing of A matrix: - Converts column major input A to row major in blocks of MCxKC with newly added pack A functions when cs_a > 1. 7. Pack routines are added for AVX512 and AVX2 INT8 LPGEMM APIs. 8. Trans A feature is now supported in: 1. u8s8s32os32/os8 2. u8s8s16os16/os8/ou8 3. s8s8s32os32/os8 4. s8s8s16os16/os8 AMD-Internal: SWLCSG-2582 Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf	2024-01-30 03:40:56 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Meghana Vankadari	77bd9a7f17	Added parameter checking for LPGEMM APIs Change-Id: I6ea89fd0d2516539e5a4e9cd8537570b23194d89	2023-11-09 21:50:55 -05:00
Meghana Vankadari	f8f4343b55	Updated cntx with packA function pointer for AVX512_VNNI support Details: - Modified bench to support testing for sizes where matrix strides are larger than the corresponding dimensions. - Modified early-return checks in all interface APIs to check validity of strides in relation to the corresponding dimension rather than checking if strides are equal to dimensions. Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4	2023-11-03 04:50:00 -04:00
mkadavil	ea0324ab95	Multi data type downscaling support for u8s8s16 - u8s8s16<u8\|s8> Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. Currently the u8s8s16 flavor of api only supports downscaling to s8 (int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at int16_t. LPGEMM is modified to support downscaling to different data types, like u8, s16, apart from s8. The framework (5 loop) passes the downscale data type to the micro-kernels. Within the micro-kernel, based on the downscale type, appropriate beta scaling and output buffer store logic is executed. This support is only enabled for u8s8s16 flavor of api's. The LPGEMM bench is also modified to support passing downscale data type for performance and accuracy testing. AMD-Internal: [SWLCSG-2313] Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30	2023-10-12 09:19:56 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
eashdash	d21cd51fde	Accumulation type for alpha, beta values and BF16 bench integration 1. Correcting the type of alpha, and beta values from C_type (output type) to accumulation type. For the downscaled LPGEMM APIs, C_type will be the downscaled type but the required type for alpha and beta values should be the accumulation type. 2. BF16 bench integration with the LPGEMM bench for both the BF16 (bf16bf16f32of32 and bf16bf16f32obf16) APIs AMD-Internal: [CPUPL-2561] Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797	2022-09-23 05:00:49 -04:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
mkadavil	f63e699c08	Fix for segmentation fault in low precision gemm. - Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL when compiled without open mp threading support. Subsequently the code is executed as if it is single-threaded. However, when B matrix needs to be packed, communicators are required (irrespective of single or multi-threaded), and the code accesses lpgemm_thrinfo_t for the same without NULL check. This results in seg fault. For the fix, a non-open mp thread decorator layer is added, which creates a placeholder lpgemm_thrinfo_t object with a communicator before invoking the 5 loop algorithm. This object will be used for packing. - Makefile for compilation of aocl_gemm bench. AMD-Internal: [CPUPL-2304] Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc	2022-07-21 11:51:40 +05:30
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00

20 Commits