amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-04-20 07:38:53 +00:00

Author	SHA1	Message	Date
mkadavil	864170f5cb	Scalar value support for zero-point and scale-factor. -As it stands, in LPGEMM, users are expected to pass an array of values with length the same as N dimension as inputs for zero point or scale factor. However at times, a single scalar value is used as zero point or scale factor for the entire downscaling operation. The mandate to pass an array requires the user to allocate extra memory and fill it with the scalar value so as to be used in downscaling. This limitation is lifted as part of this commit, and now scalar values can be passed as zero point or scale factor. -LPGEMM bench enhancements along with new input format to improve readability as well as flexibility. AMD-Internal: [SWLCSG-2581] Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f	2024-01-12 04:37:35 +05:30
Meghana Vankadari	6567df7b12	bf16bf16f32o<bf16\|f32> Fix for scaling issue when transA is enabled. Details: - LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout memory when A matrix needs to be packed. This multi-threaded lock overhead becomes prominent when m/n dimensions are relatively small, even when k is large. In order to address this, bli_pba_acquire_m is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is allocated using aligned malloc instead of checking out from memory pool. Experiments have shown malloc costs to be far lower than memory pool guarded by locks, especially for higher thread count. - Deleted few unnecessary instructions from packing kernels. - Replaced bench_input.txt with lesser number of inputs. AMD-Internal: [CPUPL-4329] Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774	2023-12-21 04:53:32 +05:30
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
Meghana Vankadari	77bd9a7f17	Added parameter checking for LPGEMM APIs Change-Id: I6ea89fd0d2516539e5a4e9cd8537570b23194d89	2023-11-09 21:50:55 -05:00
Meghana Vankadari	0c12b72651	LPGEMM bench enhancements Details: - Moved the downscale & postop options from commmandline to input file. - Now the format of the input file is as follows: dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops - In case of no-postops, 'none' has to be passed in the place of postops. - Removed duplication of mat_mul_bench_main function for bf16 APIs. - Added a function called print_matrix for each datatype which can help in printing matrices while debugging. - Added printing of ref, computed and diff values while reporting failure. - Added new functions for memory allocation and freeing. Different types of memory allocation is chosen based on mode bench is running(performance or accuracy mode). Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72	2023-11-09 03:55:10 -05:00
Eashan Dash	c3d1a3878c	Parallelized Pack and Compute Extension APIs 1. OpenMP based multi-threading parallelism is added for BLAS extension APIs of Pack and Compute 2. Both pack and compute APIs are parallelized. 3. Multi-threading of pack and compute APIs done with different number of threads can lead to inconsistent results due to output difference of the full packed matrix buffer when packed with different number of threads. 4. In multi-threaded execution, we ensure output of packed buffer is exactly the same as in single threaded execution. 5. Similarly for compute API, read of packed buffer in multi- threaded execution is exactly the same as in single-threaded execution. 6. Routines are added to compute the offsets for thread workload distribution for MT execution. 1. The offsets are calculated in such a way that it resembles the reorder buffer traversal in single threaded reordering. 2. The panel boundaries (KCxNC) remain as it is accessed in single thread, and as a consequence a thread with jc_start inside the panel cannot consider NC range for reorder. 3. It has to work with NC' < NC, and the offset is calulated using prev NC panels spanning k dim + cur NC panel spaning pc loop cur iteration + (NC - NC') spanning current kc0 (<= KC). 7. Routines to ensure the same are added for MT execution 1. frame/base/bli_pack_compute_utils.c 2. frame/base/bli_pack_compute_utils.h AMD-Internal: [CPUPL-3560] Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac	2023-11-03 08:47:17 -04:00
Meghana Vankadari	f8f4343b55	Updated cntx with packA function pointer for AVX512_VNNI support Details: - Modified bench to support testing for sizes where matrix strides are larger than the corresponding dimensions. - Modified early-return checks in all interface APIs to check validity of strides in relation to the corresponding dimension rather than checking if strides are equal to dimensions. Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4	2023-11-03 04:50:00 -04:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Meghana Vankadari	ac3e8ff01b	Bug fix and enhancements in bf16bf16f32obf16\|f32 Details: - Updated pack function call in ic loop to accept correct params. - Modified documentation in bench file to reflect updated usage of bench for downscaled APIs. - Modified memory allocation for C panel in BF16 APIs to use BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool. Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5	2023-10-25 23:28:31 +05:30
mkadavil	26d1ab5ebc	<u\|s>8s8s<16\|32>os8 memory allocation fix to circumvent scaling issue. -When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_ <A_BLOCK\|B_PANEL\|C_PANEL>, the memory is allocated by checking out a block from an internal memory pool. In order to ensure thread safety, the memory pool checkout is protected using mutex (bli_pba_lock/ bli_pba_unlock). When the number of threads trying to checkout memory (in parallel) are high, these locks tend to become a scaling bottleneck, especially when the memory is to be used for non-packing purposes (packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is enabled for temporary C accumulation. This multi-threaded lock overhead becomes prominent when m/n dimensions are relatively small, even when k is large. In order to address this, bli_pba_acquire_m is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is allocated using aligned malloc instead of checking out from memory pool. Experiments have shown malloc costs to be far lower than memory pool guarded by locks, especially for higher thread count. -LPGEMM bench fixes for crash observed when benchmarking with post-ops enabled and no downscale. AMD-Internal: [SWLCSG-2354] Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50	2023-10-23 10:00:32 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Meghana Vankadari	eb5ab3f762	LPGEMM: Added transB support for bf16bf16f32o<bf16\|f32> APIs Details: - Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs to allow reordering from column major input matrix. - Added new pack kernels that packs/reorders B matrix from column-major input format. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [CPUPL-2268] Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c	2023-10-12 23:36:18 +05:30
mkadavil	ea0324ab95	Multi data type downscaling support for u8s8s16 - u8s8s16<u8\|s8> Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. Currently the u8s8s16 flavor of api only supports downscaling to s8 (int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at int16_t. LPGEMM is modified to support downscaling to different data types, like u8, s16, apart from s8. The framework (5 loop) passes the downscale data type to the micro-kernels. Within the micro-kernel, based on the downscale type, appropriate beta scaling and output buffer store logic is executed. This support is only enabled for u8s8s16 flavor of api's. The LPGEMM bench is also modified to support passing downscale data type for performance and accuracy testing. AMD-Internal: [SWLCSG-2313] Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30	2023-10-12 09:19:56 -04:00
Meghana Vankadari	4874895a68	LPGEMM: Added transA support for bf16bf16f32o<bf16\|f32> APIs Details: - Added new params(order, trans) to aocl_get_reorder_buf_size_ and aocl_reorder_ APIs. - Added new pack kernels that packs A matrix from either row-major or column major input matrix to pack buffer with row-major format. - Updated cntx with pack kernel function pointers for packing A matrix. - Transpose of A matrix is handled by packing A matrix to row-major format during run-time. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [SWLCSG-2268, SWLCSG-2442] Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4	2023-10-11 07:16:08 -04:00
mkadavil	c3b97559c1	Zero Point support for <u\|s>8s8s<32\|16>os8 LPGEMM APIs -Downscaled / quantized value is calculated using the formula x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels for these APIs only support scaling. Zero point addition is implemented as part of this commit, with it being fused as part of the downscale post-op in the micro-kernel. The zero point input is a vector of int8 values, and currently only vector based zero point addition is supported. -Bench enhancements to test/benchmark zero point addition. AMD-Internal: [SWLCSG-2332] Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea	2023-10-10 12:05:47 +05:30
Kiran Varaganti	db4fbfe9a6	Fix compiler error for "inline" functions in LPGEMM bench Application Functions which are declared as "inline" may trigger compiler error "undefined function" This linker error is eliminated by use "static" before "inline". Therefore added "static" before all inline functions. Change-Id: I5952fb71112fc4792011c3e29be930ccfbce4562	2023-09-27 02:26:23 -04:00
mkadavil	e5e9127a68	Fixes for aocl_gemm addon compilation issues Certain functions were updated recently and now takes extra arguments for error handling. Usage of the same are now updated in aocl_gemm. Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c	2023-09-06 16:00:34 +05:30
Eleni Vlachopoulou	660cd6d1b2	Adding nrm2 target for benchmarking on Windows. Modifying blis/bench/CMakeLists.txt to include nrm2 target and produce the corresponding executable. AMD-Internal: [CPUPL-3625] Change-Id: I7945416142e07ac99510ed9500a2c620053c7e13	2023-07-10 14:03:05 -04:00
mkadavil	b167e47091	LPGEMM frame and micro-kernel updates to fix gcc9.4 compilation issue. -Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were introduced in later versions of gcc (>10) in addition to already existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to support compilation using gcc 9.4, either the masked intrinsic or other gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512) in LPGEMM Zen4 micro-kernels. -Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32) needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc 9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels for BF16 LPGEMM is excluded from compilation based on GNUC macro. AMD-Internal: [CPUPL-3396] Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7	2023-05-11 18:00:18 +05:30
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
eashdash	a72fff2be9	Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8 1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added. 2. New interface, frame and kernel files are added. 3. Frame and kernel level files added and modified for s8s8s16 4. s8s8s16 type involves design changes of 2 operations - Pack B and Mat Mul 5. Pack B kernel routines to pack B matrix for s16 FMA and compute the sum of every column of B matrix to implement the s8s8s16 operation using the s16 FMA instructions. 5. Mat Mul Kernel files to compute the GEMM output using s16 FMA. Here the A matrix elements are converted from int8 to uint8 (s16 FMA works with A matrix type uint8 only) by adding extra 128 to every A matrix element 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. Final C = C - ( (sum of column of B matrix) * 128 ) This is done to compensate for the addition of extra 128 to every A matrix elements 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s16os16 and s8s8s16os8. 8. All previously added post-ops are supported on s8s8os16/os8 also. AMD-Internal: [CPUPL-3234] Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c	2023-04-21 05:30:38 -04:00
mkadavil	3572baa9d3	aocl_softmax_f32 api's for softmax computation as part of lpgemm. -Softmax is often used as the last activation function in a neural network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))). This step happens after the final low precision gemm computation, and it helps to have the softmax functionality that can be invoked as part of the lpgemm workflow. In order to support this, a new api, aocl_softmax_f32 is introduced as part of aocl_gemm. This api computes element-wise softmax of a matrix/vector of floats. This api invokes ISA specific vectorized micro-kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3247] Change-Id: If15880360947435985fa87b6436e475571e4684a	2023-04-21 05:26:08 -04:00
mkadavil	ffa72f09cc	Support for multiple eltwise post-ops in low precision gemm. -Currently only one eltwise post-op (one of relu/prelu/gelu_tanh/ gelu_erf) is supported in the post-op struct along with bias or downscale. This setup was sufficient when only activation functions were supported as eltwise post-ops. But with the introduction of clip post-op(a type of non-activation eltwise operation), it has become necessary to extend the post-ops framework to support multiple eltwise operations, with the multiple eltwise often used in the form activation eltwise op + non-activation eltwise ops. The aocl post-op struct is modified and the post-op parser is updated to support this use case. -The lpgemm_bench is updated to support testing/benchmarking of the multiple eltwise operations use case. The function for accuracy checking is modified to support correctness testing irrespective of the order and count of post-ops. Additionally the help message is updated so as to better describe the capabilities of lpgemm_bench. AMD-Internal: [CPUPL-3244] Change-Id: If4ce8d7261d32073da8fa4757ed4f2ea0e94249f	2023-04-20 07:24:32 -04:00
mkadavil	99d10c3f88	Low precision gemm u8s8s16 downscale optimization. -Similar to downscale optimizations made for u8s8s32 gemm, the following optimizations are made to improve the downscale performance for u8s8s16 gemm: a. The store to temporary s16 buffer can be avoided when k < KC since intermediate accumulation will not required for the pc loop (only 1 iteration). The downscaled values (s8) are written directly to the output C matrix. b. Within the micro-kernel when beta != 0, the s8 data from the original C output matrix is loaded to a register, converted to s16 and beta scaling applied on it. The previous design of copying the s8 value to the s16 temporary buffer inside jc loop and using the same in beta scaling is removed. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s16 micro-kernels. Alpha scaling is now only done when alpha != 1. AMD-Internal: [CPUPL-3237] Change-Id: If25f9d1de8b9b8ffbe1bd7bce3b7b0b5094e51ef	2023-04-19 06:40:06 -04:00
mkadavil	e23765010d	aocl_gelu_<tanh\|erf>_f32 api's for gelu computation as part of lpgemm. -Currently in aocl_gemm, gelu (both tanh and erf based) computation is only supported as a post-op as part of low precision gemm api call (done at micro-kernel level). However gelu computation alone without gemm is required in certain cases for users of aocl_gemm. -In order to support this, two new api's - aocl_gelu_tanh_f32 and aocl_gelu_erf_f32 are introduced as part of aocl_gemm. These api's computes element-wise gelu_tanh and gelu_erf respectively of a matrix/ vector of floats. Both the api's invokes ISA specific vectorized micro- kernels (vectorized only when incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used to dispatch to the appropriate kernel. AMD-Internal: [CPUPL-3218] Change-Id: Ifebbaf5566d7462288a9a67f479104268b0cc704	2023-04-17 05:15:56 -04:00
eashdash	12c97021a1	Added New Post-Op - Custom Clipping for LPGEMM and SGEMM 1. Custom Clip is an element-wise post-op which is used to clip the accumulated GEMM output within a certain range. 2. The Clip Post-Op is used in downscaled and non-downscaled LPGEMM APIs and SGEMM. 3. Changes are done at frame and microkernel level to implement this post-op. 4. Different versions are implemented - AVX-512, AVX-2, SSE-2 to enable custom clipping for various LPGEMM types and SGEMM AMD-Internal: [CPUPL-3207] Change-Id: I71c60be69e5a0dc47ca9336d58181c097b9aa0c6	2023-04-17 04:38:20 -04:00
mkadavil	5e510727a9	Masked load/store to replace copy macros in u8s8s32 micro-kernels. -As part of an earlier optimization, the memcpy function call in k fringe ((k % 4) != 0 case, to utilize vpdpbusd instruction) and n fringe (n < 16 - beta scale and C store) were replaced with copy macros specifically optimized for less than 4 and 16 elements each. However upon further analysis it was observed that masked load/broadcast and masked store performed better on average than the copy macros. The copy macros contained more if conditions, which resulted in more branching and thus resulting in perf variations. It was also noted that code generation varied a lot based on the compilers when using the copy macros due to the extra conditional code. -As part of this change, the copy macros are completely replaced with masked load/broadcast/store. Performance was observed to be better and less prone to variations for the k fringe and n fringe (< 16) cases. AMD-Internal: [CPUPL-3173] Change-Id: I73e6e65302ecf02e1397541b4a32b2a536f19503	2023-04-13 09:17:26 -04:00
eashdash	bd8cd763ff	Added NEW LPGEMM TYPE- S8S8S32/S8 1. New LPGEMM type - S8S8S32/S8 is added. 2. New interface, frame and kernel files are added. 3. Frame and kernel files added/modified for S8S8S32/S8 have 2 operations - Pack B and Mat Mul 4. Pack B kernel routines to pack B matrix for VNNI and compute the sum of every column of B matrix to implement the S8S8S32 operation using the VNNI instructions. 5. Mat Mul Kernel files to compute the GEMM output using the VNNI. Here the A matrix elements are converted from int8 to uint8 (VNNI works with A matrix type uint8 only). 6. Post GEMM computation, additional operations are performed on the accumulated outputs to get the correct results. 7. With this change, two new LPGEMM APIs are introduced in LPGEMM - s8s8s32os32 and s8s8s32os8. 8. All previously added post-ops are supported on S8S8S32/S8 also. AMD-Internal: [CPUPL-3154] Change-Id: Ib18f82bde557ea4a815a63adc7870c4234bfb9d3	2023-03-31 05:44:54 -04:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
eashdash	e36f699939	Implemented ERF Based GeLU Activation for LPEGMM and SGEMM 1. Implemented efficient AVX-512, AVX-2 and SSE-2 version of the error function - ERF 2. Added error function based GeLU activation post-ops for the S32, S16 and BF16 (LPGEMM) and SGEMM APIs. 3. Changes for this includes frame and micro-kernel level changes in addition to adding the marco based function definations of the ERF function in the math-utils and gelu headerfiles. AMD-Internal: [CPUPL-3036] Change-Id: Ie50f6dcabf8896b7a6d30bbc16aa44392cc512be	2023-03-13 06:10:31 -04:00
mkadavil	1f2447f800	Post-ops support for f32 gemm(aocl_gemm). - Bias add, relu, parametric relu and gelu post-ops support added in all f32 gemm micro-kernels. These post-ops are implemented for both AVX512 and AVX2 ISA based on the micro-kernel flavor. The support is added for both row and column major cases. - Lpgemm bench updates to support f32 post-ops. AMD-Internal: [CPUPL-3032] Change-Id: Ie6840b9d4e52d2086c1b5ff2e1de80dc0cad5476	2023-02-23 18:58:59 +05:30
mkadavil	63ee4c5e4c	Remove memcpy usage in u8s8s32 lpgemm micro kernels. -As of now, memcpy is used in u8s8s32 micro-kernel for copying in k fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for small k/n dimensions, memcpy invocation has high overhead. -This issue is fixed by replacing memcpy with a MACRO based implementation of copy routine, specifically optimized for the sizes that will be encountered in fringe cases (k < 4, NR' < 16). AMD-Internal: [CPUPL-3008] Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc	2023-02-16 05:52:19 -05:00
eashdash	672544bc04	GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16 1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16 2. Changes are done at frame and micro-kernel level to implement this post-op. 3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF functions are implemented for the GeLU post-operation. 4. TANH and EXPF math functions are efficiently implemented in macro-based fashion to exploit register level fusion of GeLU with GEMM operations for improved performance 5. LPGEMM bench is changed to pass GeLU post-op as input and support accuracy check to verify functional correctness AMD-Internal: [CPUPL-2978] Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28	2023-02-02 08:25:04 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
mkadavil	4b5e24d0d9	Column major input support for f32 gemm (sgemm for lpgemm). -The f32 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. AMD-Internal: [CPUPL-2919] Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3	2023-01-16 04:04:21 -05:00
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
Harihara Sudhan S	492555785a	Fixed bench accuracy issue in LPGEMM Description: - When the value of the result in s8 for u8s8s32 and u8s8s16 are close to 0. Values are getting ceiled to 1. - Used nearbyintf to round the downscaled values in bench reference. This fixed the result mismatch issue between the vectorized kernel implementation and reference implementation in bench accuracy test. AMD-Internal: [CPUPL-2617] Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82	2022-10-07 09:48:21 +00:00
eashdash	d21cd51fde	Accumulation type for alpha, beta values and BF16 bench integration 1. Correcting the type of alpha, and beta values from C_type (output type) to accumulation type. For the downscaled LPGEMM APIs, C_type will be the downscaled type but the required type for alpha and beta values should be the accumulation type. 2. BF16 bench integration with the LPGEMM bench for both the BF16 (bf16bf16f32of32 and bf16bf16f32obf16) APIs AMD-Internal: [CPUPL-2561] Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797	2022-09-23 05:00:49 -04:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
Eleni Vlachopoulou	a5891f7ead	Adding AVX2 support for DNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [CPUPL-2551] Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c	2022-09-20 06:05:01 -04:00
mkadavil	9bc59cc500	Low Precision GEMM framework fixes for downscaling. - The temporary buffer allocated for C matrix when downscaling is enabled is not filled properly. This results in wrong gemm accumulation when beta != 0, and thus wrong output after downscaling. The C panel iterators used for filling the temporary buffer are updated to fix it. - Low precision gemm bench updated for testing/benchmarking downscaling. AMD-Internal: [CPUPL-2514] Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb	2022-09-13 07:42:29 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
eashdash	4e3e00fb7e	Added low precision GEMM - bf16bf16f32of32 Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float. 1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps 2. Kernel for packing B matrix is provided Change-Id: If5d08213068869eff060c9998596d2d2703a6793	2022-08-24 03:27:00 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
Harihara Sudhan S	e5d4fc2a70	Added low precision GEMM (u8s8s16os16) Feature Addition : Added low precision GEMM to addon. The kernel takes unsigned int8 and signed int8 as inputs and performs GEMM operation. The intermediate accumulation and output are in signed int16. - The compute kernel will perform computation only if B matrix reordered to suit the usage of AVX2 instruction vpmaddubsw. - Kernel for packing the B matrix is provided. - LPGEMM bench code was modified to test the performance and accuracy of the new variant. AMD-Internal: [CPUPL-2171] Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b	2022-08-02 02:20:00 -04:00

1 2 3 4

182 Commits