amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
mkadavil	3870792e62	Low precision gemm s32 downscale optimization. -The post operations attributes are moved to a new struct lpgemm_post_op_attr, and an object of this struct is passed to the low precision gemm kernels in place of the multiple parameters. -The u8s8s32s8 api (downscale api) performance is low when the k value is less (k < KC). Two scenarios are observed here: a. beta = 0: Currently, for downscale api, a temporary buffer is used to accumulate intermediate s32 output, so that it can be used in later iterations of pc loop (k dim). The usage of this buffer (store) can be avoided if k < KC. Here intermediate accumulation is not required, since the after the first iteration of the pc loop, the output can be downscaled and stored. b. beta != 0: In this case the existing values of the original s8 C output matrix needs to be converted to s32 and beta scaled. Currently the s8 values are converted to s32 and stored in temporary buffer in pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer is passed to the micro kernel and beta scaling is applied on this. However the mxNC block copy is costly and can be avoided if a new condition is introduced for beta scaling in the micro kernel, whereby the original s8 data is loaded instead of from the temporary buffer to a register, converted to s32 and beta scaling applied on it. AMD-Internal: [CPUPL-2884] Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c	2023-01-10 13:15:22 +05:30
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
eashdash	63864d7dfb	Added clipping while downscaling for u8s8s32os8 and u8s8s16os8. Clipping is done during the downscaling of the accumulated result from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8, to saturate the final output values between [-128,127] AMD-Internal: [LWPZENDNN-493] Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f	2022-10-11 07:28:06 -04:00
Harihara Sudhan S	492555785a	Fixed bench accuracy issue in LPGEMM Description: - When the value of the result in s8 for u8s8s32 and u8s8s16 are close to 0. Values are getting ceiled to 1. - Used nearbyintf to round the downscaled values in bench reference. This fixed the result mismatch issue between the vectorized kernel implementation and reference implementation in bench accuracy test. AMD-Internal: [CPUPL-2617] Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82	2022-10-07 09:48:21 +00:00
eashdash	d21cd51fde	Accumulation type for alpha, beta values and BF16 bench integration 1. Correcting the type of alpha, and beta values from C_type (output type) to accumulation type. For the downscaled LPGEMM APIs, C_type will be the downscaled type but the required type for alpha and beta values should be the accumulation type. 2. BF16 bench integration with the LPGEMM bench for both the BF16 (bf16bf16f32of32 and bf16bf16f32obf16) APIs AMD-Internal: [CPUPL-2561] Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797	2022-09-23 05:00:49 -04:00
mkadavil	bf4d1da1b9	Column major input support for BFloat16 gemm. -The bf16 gemm framework is modified to swap input column major matrices and compute gemm for the transposed matrices (now row major) using the existing row-major kernels. The output is written to C matrix assuming it is transposed. -Framework changes to support leading dimensions that are greater than matrix widths. -Bench changes to test low precision gemm for column major inputs. AMD-Internal: [CPUPL-2570] Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495	2022-09-22 02:50:46 -04:00
Eleni Vlachopoulou	a5891f7ead	Adding AVX2 support for DNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [CPUPL-2551] Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c	2022-09-20 06:05:01 -04:00
mkadavil	9bc59cc500	Low Precision GEMM framework fixes for downscaling. - The temporary buffer allocated for C matrix when downscaling is enabled is not filled properly. This results in wrong gemm accumulation when beta != 0, and thus wrong output after downscaling. The C panel iterators used for filling the temporary buffer are updated to fix it. - Low precision gemm bench updated for testing/benchmarking downscaling. AMD-Internal: [CPUPL-2514] Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb	2022-09-13 07:42:29 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
eashdash	4e3e00fb7e	Added low precision GEMM - bf16bf16f32of32 Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float. 1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps 2. Kernel for packing B matrix is provided Change-Id: If5d08213068869eff060c9998596d2d2703a6793	2022-08-24 03:27:00 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
mkadavil	828d3cd3d3	Post operations support for low precision gemm. - Low precision gemm is often used in ML/DNN workloads and is used in conjunction with pre and post operations. Performing gemm and ops together at the micro kernel level results in better overall performance due to cache/register reuse of output matrix. The provision for defining the post-operations and invoking the micro-kernel with it from the framework is added as part of this change. This includes adding new data structures/functions to define the post-ops to be applied and an extensible template using which new post-ops can easily be integrated. As for the post-operations, RELU and Bias Add for u8s8s32 is implemented in this first cut. - aocl_gemm bench modifications to test/benchmark RELU and Bias Add. AMD-Internal: [CPUPL-2316] Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18	2022-08-05 11:53:05 +05:30
Harihara Sudhan S	e5d4fc2a70	Added low precision GEMM (u8s8s16os16) Feature Addition : Added low precision GEMM to addon. The kernel takes unsigned int8 and signed int8 as inputs and performs GEMM operation. The intermediate accumulation and output are in signed int16. - The compute kernel will perform computation only if B matrix reordered to suit the usage of AVX2 instruction vpmaddubsw. - Kernel for packing the B matrix is provided. - LPGEMM bench code was modified to test the performance and accuracy of the new variant. AMD-Internal: [CPUPL-2171] Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b	2022-08-02 02:20:00 -04:00
Kiran Varaganti	6054b888fb	Fixed Bug in bench_trsm.c When bli_trsm() API is called, we make sure the "side" argument is "side_t" and not f77_char and argument is passed by value and not by its address. Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9	2022-07-25 15:38:30 +00:00
mkadavil	f63e699c08	Fix for segmentation fault in low precision gemm. - Low precision gemm sets thread meta data (lpgemm_thrinfo_t) to NULL when compiled without open mp threading support. Subsequently the code is executed as if it is single-threaded. However, when B matrix needs to be packed, communicators are required (irrespective of single or multi-threaded), and the code accesses lpgemm_thrinfo_t for the same without NULL check. This results in seg fault. For the fix, a non-open mp thread decorator layer is added, which creates a placeholder lpgemm_thrinfo_t object with a communicator before invoking the 5 loop algorithm. This object will be used for packing. - Makefile for compilation of aocl_gemm bench. AMD-Internal: [CPUPL-2304] Change-Id: Id505235c8421792240b84f93942ca62dac78f3dc	2022-07-21 11:51:40 +05:30
Chandrashekara K R	ff2ee0ae3f	AOCL-WINDOWS: Added the windows build system to build bench folder on windows. 1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf. Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2	2022-06-27 22:32:39 -04:00
mkadavil	6c112632a7	Low precision gemm integrated as aocl_gemm addon. - Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t). AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported along m and n dimensions. - Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix is packing entire B matrix upfront before sgemm. It allows sgemm to take advantage of packed B matrix without incurring packing costs during runtime. - Makefile updates to addon make rules to compile avx512 code for selected files in addon folder. - CPU features query enhancements to check for AVX512_VNNI flag. - Bench for int8 gemm and sgemm with B matrix reorder. Supports performance mode for benchmarking and accuracy mode for testing code correctness. AMD-Internal: [CPUPL-2102] Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182	2022-06-09 10:28:38 -04:00
Nallani Bhaskar	2acb3f6ed0	Tuned aocl dynamic for specific range in dgemm Description: 1. Decision logic to choose optimal number of threads for given input dgemm dimensions under aocl dynamic feature were retuned based on latest code. 2. Updated code in few file to avoid compilation warnings. 3. Added a min check for nt in bli_sgemv_var1_smart_threading function AMD-Internal: [ CPUPL-2100 ] Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02	2022-05-17 18:10:39 +05:30
mkurumel	ab06f17689	DGEMMT : Tuning SUP threshold to improve ST and MT performance. Details : - SUP Threshold change for native vs SUP - Improved the ST performances for sizes n<800 - Introduce PACKB in SUP to improve ST performance between 320<n<800 - 16T SUP Tuning for n<1600. AMD-Internal: [CPUPL-1981] Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda	2022-05-17 18:09:22 +05:30
Arnav Sharma	3190e547b0	Optimized AXPBYV Kernel using AVX2 Intrinsics Details: - Intrinsic implementation of axpbyv for AVX2 - Bench written for axpbyv - Added definitions in zen contexts AMD-Internal: [CPUPL-1963] Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539	2022-01-05 04:19:11 -05:00
Dipal M Zambare	8f310c3384	AOCL DTL - Added thread and execution time details in logs -- Added number of threads used in DTL logs -- Added support for timestamps in DTL traces -- Added time taken by API at BLAS layer in the DTL logs -- Added GFLOPS achieved in DTL logs -- Added support to enable/disable execution time and gflops printing for individual API's. We may not want it for all API's. Also it will help us migrate API's to execution time and gflops logs in stages. -- Updated GEMM bench to match new logs -- Refactored aocldtl_blis.c to remove code duplication. -- Clean up logs generation and reading to use spaces consistently to separate various fields. -- Updated AOCL_gettid() to return correct thread id when using pthreads. AMD-Internal: [CPUPL-1691] Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e	2021-11-12 08:58:54 +05:30
Nageshwar Singh	faeb79f2b9	Trsm bench utility missmatch DTL logs and bench AOCL-Internal: [CPUPL-1585] Change-Id: I2896d695e6bb40ec39a4f840240499927de16962	2021-11-12 08:58:52 +05:30
Meghana Vankadari	1944de1cfa	Fixed a bug in Level-3 bench files Details: - BLIS has reserved rs = cs = 1 case only for 1x1 scalars. - For vectors, even though rs = cs = 1 is a valid input, BLIS adjusts the strides to satisfy the error checking. - For an mxn matrix, if m > 1 and n = 1, BLIS sets cs = m to indicate that this is a column vector stored in column major order. Similarly BLIS sets rs = n in case of m = 1 and n > 1. - So determining storage-scheme based on row-stride could lead to errors if one of the matrices becomes vector. - Modified bench files to determine storage scheme based on stor_scheme character instead of checking row-strides. Change-Id: Id2dc0ea11f0e549ce8e49eb2c393442b33851527	2021-06-22 10:38:11 +05:30
Nageshwar Singh	3002239f83	Added bench utility for swapv API AMD-Internal: [CPUPL-1591] Change-Id: I5619d402db49d1f325e4293f3be7a8bc0dde6f15	2021-06-09 17:05:00 +05:30
Nageshwar Singh	6ca50e1b72	Added bench utility for copyv API AOCL-Internal: [CPUPL-1591] Change-Id: I00ddad565cb87cd9371d7b1df2b57394fef437e0	2021-06-09 12:29:49 +05:30
Nageshwar Singh	6842c2a30e	Bench trsv logging error Details - Passing enum rather than char for uplo, transa, and diaga - Deleting log file, and other temp files, merged in the codebase from amax AOCL-Internal: [CPUPL-1591] Change-Id: Ife85a388b45659aa608a552d18a65fe828b046b2	2021-06-08 11:54:55 +05:30
Nageshwar Singh	61b7584580	Bench addition for amaxv API AOCL-Internal: [CPUPL-1591] Change-Id: Ia9754dfed1a7302d5c267858f9005c8f64e28b46	2021-06-04 17:45:04 +05:30
Nageshwar Singh	ecfbdd16a8	Added bench utility for trsv API AOCL-Internal: [CPUPL-1591] Change-Id: I5953e13e9c75f620987ea92d92d1b1d7b5bfd043	2021-06-04 08:05:37 -04:00
Meghana Vankadari	3804e301c9	Fixed a bug in Level-3 bench files where ldc = 1 Details: - To determine whether matrices are col-stored, we were checking ldc == 1. This is incorrect as a matrix can be col-stored with ldc = 1 if dimension is 1. - Modified the condition to check row_stride instead of col stride. if row-stride != 1, we can assume that matrices are not col-stored and ignore those inputs by printing an error message. Change-Id: Id4d5b971104eb11cbcdd6d22c5c620febefd3a87	2021-06-01 10:57:18 +05:30
Kiran Varaganti	492f54fb5e	Fix a bug in bench_gemm.c When op(A) or op(B) = transpose - the leading dimensions of these matrices altered. Commented out the statements "if(transa) lda = ..." similarly for matrix B and corrected this mistake in both column and row storages. Provide a provision to call BLIS interfaces when row-major inputs are used. Change-Id: Id2041af219a64567471c14190f283274d1df2f7f	2021-05-24 12:59:28 +05:30
Dipal M Zambare	5f53d14971	Added bench utility for dotv and scalv APIs. - Added bench utility for dotv and scalv API's - Corrected logging for scalv to handle complex types - Corrected logging to remove transpose field from dotv logs AOCL-Internal: [CPUPL-1577] Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07	2021-05-21 10:00:32 +05:30
Dipal Madhukar Zambare	dac15bdb3f	Merge "Added bench utility for ger API." into amd-staging-milan-3.1	2021-05-19 08:17:09 -04:00
Dipal M Zambare	413814fe70	Fixed crash issue in bench utility for gemv API - incx and incy was not considered while allocating memory for x and y vectors. - Updated test data set AMD-Internal: [CPUPL-1578] Change-Id: I374a75aaa66f951f0f8353434d94c135d09b2f05	2021-05-19 14:21:09 +05:30
Dipal M Zambare	0e82783f1c	Added bench utility for ger API. AOCL-Internal: [CPUPL-1577] Change-Id: Icc7a4590f605d7273077a7d2a42d4ecbafed2248	2021-05-19 14:05:01 +05:30
Nallani Bhaskar	a59796ef16	Updated leading dimensions for transpose case in gemm bench 1. Updated lda, ldb based on trans flags 2. Updated deriving storage type using leading dimension 2. Cleanup and alignment 3. Included transpose and row major cases in inputgemm.txt Change-Id: I25f5cd522eb64f212445d98f4682132bf5a330b6	2021-05-14 15:26:20 +05:30
Meghana Vankadari	a3600d395d	Added bench app for syrk - input is a log file generated from AOCL_DTL Change-Id: I25dd695dea267a89a5c666d66abc4b91a57956c8	2021-05-11 14:57:51 +05:30
Dipal Madhukar Zambare	2b80e8824a	Merge "Added bench utility for gemv API." into amd-staging-milan-3.1	2021-05-11 01:09:22 -04:00
Dipal M Zambare	08424e8896	Added bench utility for gemv API. AMD-Internal: [CPUPL-1558] Change-Id: Iaba1aa164fa589fa7f5047f314b26a24c4c2c3a7	2021-05-10 15:01:47 +05:30
Nageshwar Singh	a88cb82cec	Revert "Adding trans h support in bench_gemm.c" This reverts commit `791903b31c`. Change-Id: I24403cced67ea9e851adb58a8bf01a3e17bb4e85	2021-05-07 04:11:30 -04:00
Kiran Varaganti	433f17b6cd	bench_gemmt Bug Fix Fix reading input parameters Interchange the reading of n and k, first n appears and then k appears in the logs. Added comments to explain the format of the input gemmt log. Change-Id: I44c6081d4449ba210728bc089c4215d5eef18834	2021-05-06 14:54:15 +05:30
Meghana Vankadari	713ca659b5	Added bench app for gemmt - input is a log file generated from AOCL DTL Change-Id: Ia3390b529244f529d9741c86a6f8dc35a589f714	2021-04-19 09:40:24 +05:30
Nageshwar Singh	791903b31c	Adding trans h support in bench_gemm.c Change-Id: If340d515c38a593df26d5075e29685ef044601a5	2021-03-02 02:33:06 +05:30
Kiran Varaganti	80a516382e	Fixed wrong dimensions check in bench/bench_gemm.c application Verifying the valid values of m, n, k, lda, ldb and ldc is removed. Since the bench app is run on logs collected from AOCL traces. The correct way of checking should consider transpose parameter and storage order. Change-Id: If0fbf733c2650c6f328661293eb99d062685d638	2020-11-20 20:39:20 +05:30
bhaskarn	008fe49df6	Added bench application for trsm Description: Added bench_trsm.c to read inputs from AOCL DTL logs to benchmark Added sample input file Change-Id: I6806e42244bf775cbed457553ca07fb0222ef597	2020-11-09 13:06:39 -05:00
Kiran Varaganti	60642d98a3	Benchmark using AOCL Logs as input Added benchmark application for gemm - input is a log file generated from AOCL DTL from BLIS. Change-Id: I2ac7a3c48d5a37c5b24ec0f0cff7e7886dad0b99	2020-11-06 14:31:53 +05:30

46 Commits