amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-04 06:21:12 +00:00

Author	SHA1	Message	Date
Meghana Vankadari	bfc512d3e1	Implemented batch_gemm for bf16bf16f32of32\|bf16 Details: - The batch matmul performs a series of matmuls, processing more than one GEMM problem at once. - Introduced a new parameter called batch_size for the user to indicate number of GEMM problems in a batch/group. - This operation supports processing GEMM problems with different parameters including dims,post-ops,stor-schemes etc., - This operation is optimized for problems where all the GEMMs in a batch are of same size and shape. - For now, the threads are distributed among different GEMM problems equally irrespective of their dimensions which leads to better performance for batches with identical GEMMs but performs sub-optimally for batches with non-identical GEMMs. - Optimizations for batches with non-identical GEMMs is in progress. - Added bench and input files for batch_matmul. AMD-Internal: [SWLCSG-2944] Change-Id: Idc59db5b8c5794bf19f6f86bcb8455cd2599c155	2025-01-03 03:28:32 -05:00
Meghana Vankadari	fbb72d047f	Added group quantization and zero-point support for WOQ kernels Description: 1. Added group quantization and zero-point (zp) in aocl_gemm_bf16s4f32o<bf16\|f32> API. 2. Group quantization is technique to improve accuracy where scale factors to dequantize weights varies at group level instead of per channel and per tensor level. 3. Added zp and scaling in woq packb kernels so that for large M values zp and scaling are performed at pack-b stage and bf16 kernels are called 4. Adding zp support and scaling to default path in WoQ kernels created some performance overhead when M value is very small. 5. Added string group_size to lpgemm bench to read group size from bench_input.txt and tested for various combinations of matrix dimensions. 6. The scalefactors could be of type float or bf16 and the zeropoint values are expected to be in int8 format. AMD-Internal: [SWLCSG-3168, SWLCSG-3172] Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57	2024-12-02 06:46:13 +00:00
Mithun Mohan	097cda9f9e	Adding support for AOCL_ENABLE_INSTRUCTIONS for f32 LPGEMM API. -Currently lpgemm sets the context (block sizes and micro-kernels) based on the ISA of the machine it is being executed on. However this approach does not give the flexibility to select a different context at runtime. In order to enable runtime selection of context, the context initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env variable and set the context based on the same. As part of this commit, only f32 context selection is enabled. -Bug fixes in scale ops in f32 micro-kernels and GEMV path selection. -Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512). This is only for B matrix and helps remove dependency of f32 lpgemm api on the BLIS packing framework. AMD Internal: [CPUPL-5959] Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6	2024-10-30 08:52:22 +00:00
Meghana Vankadari	5120f98e12	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289	2024-09-10 12:00:10 +00:00
Mithun Mohan	cf123aa926	Disabling smart threading for small input dimensions. -It has been observed that reduction of threads as part of smart threading for smaller input dimensions hampers the performance of the other inputs with larger dimensions due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (small m and n) fixes this issue. -Bug fixes related to work split in LPGEMV for n < NR and m < MR cases. AMD Internal: [SWLCSG-2948] Change-Id: I0117dc0ea6820a9fac8e14f93374b54a7d80c121	2024-09-06 09:20:42 -04:00
mkadavil	1257eaf72d	Disabling smart threading for bandwidth bound input patterns. For some applications, one of the input dimension is mostly m < MR or n < NR with the other dimension being small for the most part, with intermittent large ones. Currently in these cases (m < MR or n < NR), the number of threads used is reduced (as part of smart threading) if the other dimension (n or m) is also small. For larger dimensions all the threads are used. However its been observed that this reduction of threads hampers the performance of the larger inputs due to lower operating frequency of the newly launched threads (apart from the existing ones). Disabling smart threading for these bandwidth bound input patterns (m < MR or n < NR) fixes this issue. AMD Internal: [SWLCSG-2948] Change-Id: I5334860cf4411ea4504d2e6bc598b9904780bbbf	2024-09-02 02:18:45 +05:30
Deepak Negi	6dcf500703	Element wise operations API for float(f32) input matrix in LPGEMM. This API supports applying element wise operations (eg: post-ops) on a float(f32) input matrix to get an output matrix of the same (float(f32)). Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24 AMD Internal: [SWLCSG-2947] Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24	2024-08-27 03:28:52 -04:00
mkadavil	f040ba617f	Element wise operations API for bfloat16 input matrix in LPGEMM. -This API supports applying element wise operations (eg: post-ops) on a bfloat16 input matrix to get an output matrix of the same(bfloat16) or upscaled data type (float). -Benchmarking/testing framework for the same is added. AMD Internal: SWLCSG-2947 Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2	2024-08-05 07:17:08 -04:00
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
Bhaskar Nallani	21d6ab6a21	Improved thread balancing for aocl_gemm f32 API Description: 1. Updated the thread partition logic for aocl_gemm_f32f32f32of32 for m<MR, n<NR cases and also balanced thread in m, n directions such that each thread gets equal amount of work and not to span thread without any work. 2. Disabled dynamic enabling of packing of a and b matrixes for smaller sizes for genoa architecture. AMD-Internal: [SWLCSG-2353 , SWLCSG-2391] Change-Id: I03b2c50e592c2e9d336ea84c0e0394af63a34cec	2023-11-24 03:45:44 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
mkadavil	ed052c6c44	Smart Threading for LPGEMM <u\|s>8s8s<16\|32>os<8\|16\|32> API. The LPGEMM micro-kernel operates on blocks of dimension MRxKC and KCxNR. Current LPGEMM design involves using all the available threads for computing the output. If the number of threads assigned along ic or jc direction is more than M/MR or N/NR blocks respectively, it could results in threads sleeping due to the lack of MR or NR blocks. This scenario is now handled by reducing the number of threads if there are threads without any work (MR or NR blocks). AMD-Internal: [SWLCSG-2354, SWLCSG-2389, SWLCSG-2267] Change-Id: I74819337c7a0d3ab05ea0e18bb42780f977ea8f6	2023-11-09 00:50:30 -05:00
mkadavil	ea0324ab95	Multi data type downscaling support for u8s8s16 - u8s8s16<u8\|s8> Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. Currently the u8s8s16 flavor of api only supports downscaling to s8 (int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at int16_t. LPGEMM is modified to support downscaling to different data types, like u8, s16, apart from s8. The framework (5 loop) passes the downscale data type to the micro-kernels. Within the micro-kernel, based on the downscale type, appropriate beta scaling and output buffer store logic is executed. This support is only enabled for u8s8s16 flavor of api's. The LPGEMM bench is also modified to support passing downscale data type for performance and accuracy testing. AMD-Internal: [SWLCSG-2313] Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30	2023-10-12 09:19:56 -04:00
mkadavil	e5e9127a68	Fixes for aocl_gemm addon compilation issues Certain functions were updated recently and now takes extra arguments for error handling. Usage of the same are now updated in aocl_gemm. Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c	2023-09-06 16:00:34 +05:30
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
eashdash	a72fff2be9	Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8 1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added. 2. New interface, frame and kernel files are added. 3. Frame and kernel level files added and modified for s8s8s16 4. s8s8s16 type involves design changes of 2 operations - Pack B and Mat Mul 5. Pack B kernel routines to pack B matrix for s16 FMA and compute the sum of every column of B matrix to implement the s8s8s16 operation using the s16 FMA instructions. 5. Mat Mul Kernel files to compute the GEMM output using s16 FMA. Here the A matrix elements are converted from int8 to uint8 (s16 FMA works with A matrix type uint8 only) by adding extra 128 to every A matrix element 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. Final C = C - ( (sum of column of B matrix) * 128 ) This is done to compensate for the addition of extra 128 to every A matrix elements 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s16os16 and s8s8s16os8. 8. All previously added post-ops are supported on s8s8os16/os8 also. AMD-Internal: [CPUPL-3234] Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c	2023-04-21 05:30:38 -04:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
mkadavil	8dff49837d	Lpgemm source restructuring to support amdzen config. -Currently lpgemm can only be built using either zen3 or zen4 config. The lpgemm kernel code is re-structured to support amdzen, and thus multi machine deployment. -The micro-kernel calls (gemm and pack) are currently hardcoded in the lpgemm framework. This is removed and a new lpgemm_cntx based dispatch mechanism is designed to support runtime configurability for micro-kernels. AMD-Internal: [CPUPL-2965] Change-Id: I4bbcb4e5db767def1663caf5481f0b4c988149ef	2023-02-21 08:35:38 -05:00
bhaskarn	91a9968a5e	Developed intrinsic based f32 kernels in lpgemm Description: 1. Developed row variant intrinsic Kernels for float32/sgemm which are called from lpgemm api aocl_gemm_f32f32f32of32() 2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are developed using avx512. 3. 6x16m main kernel and respective n fringe and mn fringe are are developed based on avx2 and avx 4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic dispatch are planned next 5. When leading dims are greater than dims bench_lpgemm need to be updated to test it and this is planned next. Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370	2023-02-20 01:11:22 -05:00
mkadavil	4b5e24d0d9	Column major input support for f32 gemm (sgemm for lpgemm). -The f32 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. AMD-Internal: [CPUPL-2919] Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3	2023-01-16 04:04:21 -05:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
eashdash	e1349c0c71	LPGEMM BF16 MT panel based balancing Introduced multi-thread panel based balancing for BF16 to improve the overall MT performance. AMD-Internal: [CPUPL-2502] Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25	2022-09-07 03:20:50 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
mkadavil	a7d1cc7369	Multi-Threading support for BFloat16 gemm. -OpenMP based multi-threading support added for BFloat16 gemm. Both gemm and reorder api's are parallelized. -Multi-threading support for u8s8s16 reorder api. -Typecast issues fixed for bfloat16 gemm kernels. AMD-Internal: [CPUPL-2459] Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109	2022-08-30 02:54:19 -04:00
eashdash	4e3e00fb7e	Added low precision GEMM - bf16bf16f32of32 Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float. 1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps 2. Kernel for packing B matrix is provided Change-Id: If5d08213068869eff060c9998596d2d2703a6793	2022-08-24 03:27:00 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
Harihara Sudhan S	60de0a1856	Multithreading and support for unpacked B matrix in u8s8s16os16 Fucntionality - When the B matrix is not reordered before the u8s8s16os16 compute kernel call packing of B matrix is done as part of the five loop algorithm. The state of B matrix (packed or unpacked) is given as an user input. - Packing of B matrix is done as part of the five loop compute. - Temprorary buffer for pack B is allocated in the five loop algorithm - Multithreading for computation kernel - Configuration constants for u8s8s16os16 are part of the lpgemm config AMD-Internal: [CPUPL-2171] Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8	2022-08-05 19:28:37 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
mkadavil	f63e699c08	Fix for segmentation fault in low precision gemm. - Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL when compiled without open mp threading support. Subsequently the code is executed as if it is single-threaded. However, when B matrix needs to be packed, communicators are required (irrespective of single or multi-threaded), and the code accesses lpgemm_thrinfo_t for the same without NULL check. This results in seg fault. For the fix, a non-open mp thread decorator layer is added, which creates a placeholder lpgemm_thrinfo_t object with a communicator before invoking the 5 loop algorithm. This object will be used for packing. - Makefile for compilation of aocl_gemm bench. AMD-Internal: [CPUPL-2304] Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc	2022-07-21 11:51:40 +05:30
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00

32 Commits