amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 18:15:37 +00:00

Author	SHA1	Message	Date
Vignesh Balasubramanian	84faccdd7d	Enabling the vectorized path for SNRM2_ - Enabled the vectorized AVX-2 code-path for SNRM2_. The framework queries the architecture ID and calls the vectorized kernel based on the architecture support. - In case of not having the architecture support, we use the default path based on the sumsqv method. AMD-Internal: [CPUPL-3277] Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f	2023-11-03 17:45:56 +05:30
Edward Smyth	d8b8f68066	Improvements to xerbla functionality (2) Improvements to functionality introduced in commit `6d0444497f`: - Call to bli_init_auto() before calling PASTEBLACHK macro in gemv caused significant runtime overhead. Initialize stored info_value directly. - Add similar code in frame/compat/f2c routines. AMD-Internal: [CPUPL-3520] Change-Id: I2df201aed7dbceb4cbe66d6c81b5a03e8092de89	2023-11-03 06:38:40 -04:00
Meghana Vankadari	f8f4343b55	Updated cntx with packA function pointer for AVX512_VNNI support Details: - Modified bench to support testing for sizes where matrix strides are larger than the corresponding dimensions. - Modified early-return checks in all interface APIs to check validity of strides in relation to the corresponding dimension rather than checking if strides are equal to dimensions. Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4	2023-11-03 04:50:00 -04:00
Vignesh Balasubramanian	ef545b928e	Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel - The call to the bli_saxpyf_zen_int_6( ... ) is explicitly present in the bli_gemv_unf_var2_amd.c file, as part of the bli_sgemv_unf_var2( ... ) function. This was changed to bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor from 6 to 5 ), in accordance to the function pointer present in the zen3 and zen4 context files. - Changed the accumulator type to double from float, inside the fringe loop for unit-strides(vectorized path) and non-unit strides (scalar code). AMD-Internal: [CPUPL-4028] Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e	2023-11-03 01:37:43 -04:00
Harihara Sudhan S	106342f402	ZGEMV optimization for special cases in beta - Avoiding scaling of y vector by beta when beta is 1. AMD-Internal: [CPUPL-3829] Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1	2023-11-02 08:21:46 -04:00
mkadavil	d1844678f4	LPGEMM <u\|s>8s8s16ou8 fixes for incorrect zero point addition. -The zero point data type is different based on the downscale data type. For int8_t downscale type, zero point type is int8_t whereas for uint8_t downscale type, it is uint8_t. During downscale post-op, the micro-kernels upscales the zero point from its data type (int8_t or uint8_t) to that of the accumulation data type and then performs the zero point addition. The accumulated output is then stored as downscaled type in a later storage phase. For the <u\|s>8s8s16 micro-kernels, the upscaling to int16_t (accumulation type) is always performed assuming the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction. However this will result in incorrect upscaled zero point values if the downscale type is uint8_t and the associated zero point type is also uint8_t. This issue is corrected by switching between the correct upscale instruction based on the zero point type. AMD-Internal: [SWLCSG-2500] Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076	2023-11-02 01:30:48 -04:00
Nallani Bhaskar	b3391ef5da	Updated ERF threshold and packa changes in bf16 Description: 1. Updated ERF function threshold from 3.91920590400 to 3.553 to match with the reference erf float implementation which reduced errors a the borders and also clipped the output to 1.0 2. Updated packa function call with pack function ptr in bf16 api to avoid compilation issues for non avx512bf16 archs 3. Updated lpgemm bench [AMD-Internal: SWLCSG-2423 ] Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d	2023-10-29 23:55:46 +05:30
Edward Smyth	248dc2af9a	Implement AOCL_ENABLE_INSTRUCTIONS environment variable Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE. The details are: 1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both supported, with BLIS_ARCH_TYPE taking precedence if both are set. 2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4" code paths respectively in AMD focused builds, or for "skx" and "haswell" respectively in Intel focused builds. These names are not case-sensitive. 3. BLIS_ARCH_TYPE specifies the code path to use. If this is unsupported, e.g. zen4 code path on a Milan or earlier system, that code path is still executed, likely resulting in an illegal instruction error. 4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on the system (for AVX2 and AVX512), and try a "lower" ISA option if the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic. 5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set. AMD-Internal: [CPUPL-4105] Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315	2023-10-27 14:59:33 -04:00
Edward Smyth	834bf604c1	Option to use shared library for BLIS tests Current BLIS makefile always uses the static library on Linux for all BLIS test programs. This commit adds the option to use the shared library instead by specifying e.g. make checkblis USE_SHARED=yes Executables are generated in different sub-directories for static and shared libraries. AMD-Internal: [CPUPL-4107] Change-Id: I3ab5d505cfbc5f6ef47aa28fcbb846c52d56c3f2	2023-10-27 11:32:00 -04:00
Shubham Sharma	d45d1d68c6	Reset ZMM Registers before exiting, in L3 APIs - Register ZMM16 to ZMM31 are zeroed after L3 api calls. - This change is done only for ZEN4 code path. - bli_zero_zmm function is added which resets these registers. AMD-Internal: [CPUPL-3882] Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8 (cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)	2023-10-27 00:51:04 -04:00
Harsh Dave	7bcb701b79	Fixed functionality failure for dgemm tiny kernel. - For k > KC, C matrix is getting scaled by beta on each iteration. It should be scaled only once. Fixed the scaling of C matrix by beta in K loop. - Corrected A and B matrix buffer offsets, for cases where k > KC. AMD-Internal: [CPUPL-4078] AMD-Internal: [CPUPL-4079] AMD-Internal: [CPUPL-4081] AMD-Internal: [CPUPL-4080] AMD-Internal: [CPUPL-4087] Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa	2023-10-26 15:11:59 +05:30
Meghana Vankadari	ac3e8ff01b	Bug fix and enhancements in bf16bf16f32obf16\|f32 Details: - Updated pack function call in ic loop to accept correct params. - Modified documentation in bench file to reflect updated usage of bench for downscaled APIs. - Modified memory allocation for C panel in BF16 APIs to use BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool. Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5	2023-10-25 23:28:31 +05:30
mkadavil	26d1ab5ebc	<u\|s>8s8s<16\|32>os8 memory allocation fix to circumvent scaling issue. -When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_ <A_BLOCK\|B_PANEL\|C_PANEL>, the memory is allocated by checking out a block from an internal memory pool. In order to ensure thread safety, the memory pool checkout is protected using mutex (bli_pba_lock/ bli_pba_unlock). When the number of threads trying to checkout memory (in parallel) are high, these locks tend to become a scaling bottleneck, especially when the memory is to be used for non-packing purposes (packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is enabled for temporary C accumulation. This multi-threaded lock overhead becomes prominent when m/n dimensions are relatively small, even when k is large. In order to address this, bli_pba_acquire_m is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is allocated using aligned malloc instead of checking out from memory pool. Experiments have shown malloc costs to be far lower than memory pool guarded by locks, especially for higher thread count. -LPGEMM bench fixes for crash observed when benchmarking with post-ops enabled and no downscale. AMD-Internal: [SWLCSG-2354] Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50	2023-10-23 10:00:32 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Arnav Sharma	c1612f6838	Gtestsuite Framework and Unit Tests for Pack and Compute Extension APIs - Added framework for unit testing of BLAS and CBLAS interfaces for the Pack and Compute Extension APIs. - These test the integrated functionality of the trio of ?gemm_pack_get_size(), ?gemm_pack() and ?gemm_compute() APIs. - Note: Only MKL can be used as reference for now. AMD-Internal: [CPUPL-3560] Change-Id: I801654447a716da06c9ccf9db01d553817871571	2023-10-16 09:35:42 -04:00
Edward Smyth	6d0444497f	Improvements to xerbla functionality The following improvements have been implemented: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but don't stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 Implementation details: - Values of the environment variables are stored and retrieved from global_rntm. - Info value is stored and retrieved from tl_rntm. It is set to 0 during initialization for all calls and updated by xerbla if an error has occurred. - Call to bli_init_auto before calling PASTEBLACHK macro (which calls xerbla) will reinitialize info_value to 0 via call to bli_thread_update_rntm_from_env AMD-Internal: [CPUPL-3520] Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d	2023-10-16 08:48:51 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Vignesh Balasubramanian	81161066e5	Multithreading the DNRM2 and DZNRM2 API - Updated the bli_dnormfv_unb_var1( ... ) and bli_znormfv_unb_var1( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Added the logic to distribute the job among the threads such that only one thread has to deal with fringe case(if required). The remaining threads will execute only the AVX-2 code section of the computational kernel. - Added reduction logic post parallel region, to handle overflow and/or underflow conditions as per the mandate. The reduction for both the APIs involve calling the vectorized kernel of dnormfv operation. - Added changes to the kernel to have the scaling factors and thresholds prebroadcasted onto the registers, instead of broadcasting every time on a need basis. - Non-unit stride cases are packed to be redirected to the vectorized implementation. In case the packing fails, the input is handled by the fringe case loop in the kernel. - Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... ) and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe cases of size = 2 ( and ) size = 1 or non-unit strides respectively. AMD-Internal: [CPUPL-3916][CPUPL-3633] Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d	2023-10-16 07:26:27 -04:00
Harsh Dave	7a4f84fbac	Optimized dgemm for tiny input sizes. - This commit focused on enhancing the performance of dgemm for matrices for very small dimenstions. - blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing the conventional SUP framework code path. As SUP framework code path requires the creation and initilization of blis objects, accessing all the needed meta-information from objects, querying contexts which adds performance penaulty while computing for matrices with very small dimensions. - To avoid such performance penaulty blis_dgemm_tiny function implements a lightweight support code so that it can re-use dgemm SUP kernels such a way that it directly operates on input buffers. It avoids framework overhead of creating and intializing blis objects, context intialization, accessing other large framework data structures. - blis_dgemm_tiny function checks for threshold condition to match before picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500. While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is invoked. -blis_dgemm_tiny function supports single threaded computation as of now. AMD-Internal: [CPUPL-3574] Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f	2023-10-16 05:52:49 -04:00
Harsh Dave	edbbfd9a86	Optimized AVX512 DGEMM SUP edge kernels - For edge kernels which handles the corner cases and specially for cases where there is really small amount of computation to be done, executing FMA efficiently becomes very crucial. - In previous implementation, edge kernels were using same, limited number of vector register to hold FMA result, which indirectly creates dependency on previous FMA to complete before CPU can issue new FMA. - This commit address this issue by using different vector registers that are available at disposal to hold FMA result. - That way we hold FMA results in two sets of vector registers, so that sub-sequent FMA won't have to wait for previous FMA to complete. - At the end of un-rolled K loop these two sets of vector registers are added together to store correct result in intended vector registers. - Following kernels are modified: bli_dgemmsup_rv_zen4_asm_24x4m, bli_dgemmsup_rv_zen4_asm_24x3m, bli_dgemmsup_rv_zen4_asm_24x2m, bli_dgemmsup_rv_zen4_asm_24x1m, bli_dgemmsup_rv_zen4_asm_24x1, bli_dgemmsup_rv_zen4_asm_16x1, bli_dgemmsup_rv_zen4_asm_8x1, bli_dgemmsup_rv_zen4_asm_24x2, bli_dgemmsup_rv_zen4_asm_16x2, bli_dgemmsup_rv_zen4_asm_8x2, bli_dgemmsup_rv_zen4_asm_24x3, bli_dgemmsup_rv_zen4_asm_16x3, bli_dgemmsup_rv_zen4_asm_8x3, bli_dgemmsup_rv_zen4_asm_16x4, bli_dgemmsup_rv_zen4_asm_8x4, bli_dgemmsup_rv_zen4_asm_16x5, bli_dgemmsup_rv_zen4_asm_8x5, bli_dgemmsup_rv_zen4_asm_16x6, bli_dgemmsup_rv_zen4_asm_8x6, bli_dgemmsup_rv_zen4_asm_8x7, bli_dgemmsup_rv_zen4_asm_8x8 AMD-Internal: [CPUPL-3574] Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44	2023-10-16 04:03:56 -04:00
Eleni Vlachopoulou	46459a958d	Updating BLIS C++ interface trsm test. - Making A diagonally dominant to ensure that the problem at hand is solvable. AMD-Internal: [CPUPL-3575] Change-Id: I27cc76a212d4d10aacce880895e1e0d7532e4eb7	2023-10-16 03:59:52 -04:00
Shubham Sharma	9a2a4151ac	Added improved ZTRSM AVX2 kernels - Added 2x6 ZGEMM row-preferred kernel. - Kernel supports prefetch_a, prefetch_b, prefetch_a_next and prefetch_b_next. - Multiple Ways to prefetch c are supported. - prefetch_a and prefetch_c are enabled by default. - K loop is divided into multiple subloops for better c prefetch. - Added 2x6 ZTRSM row-preferred lower and upper kernels using AVX2 ISA. - These kernels are used for ZTRSM only, zgemm still uses 3x4 kernel. - Kernels support row/col/gen storage. - Updated the zen3 and zen4 config to enable use of these kernels for TRSM in zen3 and zen4 path. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73	2023-10-13 07:43:33 -04:00
Harihara Sudhan S	105de694cf	Optimized ZGEMV variant 1 - Added an explicit function definition for ZGEMV var 1. This removes the need to query the context for Zen architectures. - Added a new INSERT_GENTFUNC to generate the definition only for scomplex type. - Rewrote ZDOTXF kernel and added the function name for ZDOTV instead of querying it. - With this change fringe loop is vectorized using SSE instructions. AMD-Internal:[CPUPL-3997] Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3	2023-10-13 05:03:53 -04:00
Meghana Vankadari	eb5ab3f762	LPGEMM: Added transB support for bf16bf16f32o<bf16\|f32> APIs Details: - Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs to allow reordering from column major input matrix. - Added new pack kernels that packs/reorders B matrix from column-major input format. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [CPUPL-2268] Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c	2023-10-12 23:36:18 +05:30
mkadavil	ea0324ab95	Multi data type downscaling support for u8s8s16 - u8s8s16<u8\|s8> Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. Currently the u8s8s16 flavor of api only supports downscaling to s8 (int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at int16_t. LPGEMM is modified to support downscaling to different data types, like u8, s16, apart from s8. The framework (5 loop) passes the downscale data type to the micro-kernels. Within the micro-kernel, based on the downscale type, appropriate beta scaling and output buffer store logic is executed. This support is only enabled for u8s8s16 flavor of api's. The LPGEMM bench is also modified to support passing downscale data type for performance and accuracy testing. AMD-Internal: [SWLCSG-2313] Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30	2023-10-12 09:19:56 -04:00
Vignesh Balasubramanian	a6a67fea2d	ZAXPBYV optimizations for handling unit and non-unit strides - Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational logic. The kernel performs two different sets of compute based on the value of alpha, for both unit and non-unit strides. There are no constraints on beta scaling of the 'y' vector. - Updated the logic to support 'x' conjugate in the computation. The kernel supports conjugate/no conjugate operation through the usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... ) intrinsics. - Updated the early return condition in the kernel to adhere to the standard compliance. - Updated the scalar computation with vector computation(using 128 bit registers), in case of dealing with a single element(fringe case) in unit-stride or vectors with non-unit strides. A single dcomplex element occupies 128 bits in memory, thereby providing scope for this optimization. - Added accuracy and extreme value testing with sufficient sizes and initializations, to test the required main and fringe cases of the computation. AMD-Internal: [CPUPL-3623] Change-Id: I7ae918856e7aba49424162290f3e3d592c244826	2023-10-12 06:31:08 -04:00
Meghana Vankadari	3a71550bc3	Enabling SUP blocksizes & kernels for generic config Details: - pack and compute extension APIs derive blocksizes(MR, NR...) from SUP cntx. - SUP blocksizes are not set for generic/skx configs. As a result pack and compute APIs cause floating point exceptions. - To fix these issues, we have enabled non-zero SUP blocksizes for generic config and zen4 SUP blocksizes for skx config. - However, these changes will not enable SUP path for skx/generic config as thresholds are set to zero. - To enable SUP path for skx config, more work is needed like non-zero thresholds and modifications to build system. Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9	2023-10-12 05:27:10 -04:00
bhaskarn	5fd24c27a7	Updated expf max min precission fix nan issue in Tanh Description: The expf_max and expf_min have more precission than the computation which is leading to corss the clipping at the edge case which is causing nan's in the tanh output. Updated the thresholds to less precission to clip the edge cases to avoid nan's in the tanh output. AMD-Internal: [SWLCSG-2423 ] Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29	2023-10-12 01:04:59 -04:00
Meghana Vankadari	4874895a68	LPGEMM: Added transA support for bf16bf16f32o<bf16\|f32> APIs Details: - Added new params(order, trans) to aocl_get_reorder_buf_size_ and aocl_reorder_ APIs. - Added new pack kernels that packs A matrix from either row-major or column major input matrix to pack buffer with row-major format. - Updated cntx with pack kernel function pointers for packing A matrix. - Transpose of A matrix is handled by packing A matrix to row-major format during run-time. - Updated Early-return check conditions to account for trans parameters. - Updated bench file to test/benchmark transpose support. AMD-Internal: [SWLCSG-2268, SWLCSG-2442] Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4	2023-10-11 07:16:08 -04:00
Shubham Sharma	25bab76f58	Changed threshold to use DTRSM Small MT - Threshold to use DTRSM small MT code path is lowered. AMD-Internal: [CPUPL-3781] Change-Id: Ie1f232aa6d216b839df23657b54edb0448a64267	2023-10-11 01:33:00 -04:00
Chandrashekara K R	6132194468	Updated "HEADER_PATH" in main CMakeLists.txt file. - Updated "HEADER_PATH" cmake variable to make sure blis header file is not missing function declarations on Windows. AMD-Internal: [CPUPL-3881] Change-Id: Id71ec16c800411cd727fc78e3f772ea1b751f971	2023-10-10 08:44:07 -04:00
Vignesh Balasubramanian	9828039030	Bugfix : Inversion of sign bit with early return in SNRM2_ - The bli_snormfv_unb_var1( ... ) function returns early in case of n = 1, and uses the blis macro bli_fabs( ... ) to set the norm to the absolute value of the element. - This macro inverts the sign bit even if the element is 0.0. A check is added to re-invert the sign bit in this case, so that the norm is set to 0.0 instead of -0.0. - Added the same early exit condition on bli_dnormfv_unb_var1( ... ) when n = 1. AMD-Internal: [CPUPL-3923] Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33	2023-10-10 04:21:09 -04:00
mkadavil	c3b97559c1	Zero Point support for <u\|s>8s8s<32\|16>os8 LPGEMM APIs -Downscaled / quantized value is calculated using the formula x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels for these APIs only support scaling. Zero point addition is implemented as part of this commit, with it being fused as part of the downscale post-op in the micro-kernel. The zero point input is a vector of int8 values, and currently only vector based zero point addition is supported. -Bench enhancements to test/benchmark zero point addition. AMD-Internal: [SWLCSG-2332] Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea	2023-10-10 12:05:47 +05:30
Edward Smyth	85f2bf6c4a	Fix for x86_64 builds Configuration x86_64 includes all Intel and AMD sub-configurations. Fixes to enable this to work correctly again are: - In config_registry use amdzen rather than amd64 in x86_64 family. - Copy settings from config/amdzen/bli_family_amdzen.h to config/x86_64/bli_family_x86_64.h - Modify configure to set enable_aocl_zen=yes for x86_64, but not for amd64_legacy. - Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are enabled. Note: sub-configurations knl and bulldozer use instructions that are not supported on most x86_64 processors. AMD-Internal: [CPUPL-3838] Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04	2023-10-09 07:24:21 -04:00
jagar	5d578684ea	GtestSuite: Update in source code to make it compatible on MSVC(windows) AMD-Internal: [CPUPL-2732] Change-Id: Ifd9372bf9b0f00c2bf24442ea8519bfcf4e5db5b	2023-10-09 04:43:29 -04:00
jagar	712a84d50f	Gtestsuite: Update in cmake to search reflib in given path AMD-Internal: [CPUPL-2732] Change-Id: Ide2b98a95f81f394c7c01cc3a3b5ae6fa0403a82	2023-10-05 05:39:27 -04:00
eashdash	30bdeecbcc	Added BLAS Extension APIs - Get Size and Pack API 1. 4 new APIs are added to support packed compute GEMM operations 1. dgemm_pack_get_size 2. sgemm_pack_get_size 3. dgemm_pack 4. sgemm_pack 2. Pack_get_size API 1. Returns size in bytes required for packing of input 2. Requires identifier to identify the input matrix to be packed 3. Additionally requires 3 integer parameters for input dimensions 3. Packed buffer is allocated using the pack size computed 4. Pack API: 1. Performs full matrix packing of the input 2. Additionally, performs the alpha scaling 3. Packed buffer created contains the full packed matrix 5. The GEMM compute calls are required to be operated on the packed buffer with alpha = 1 since alpha scaling is already done by the Pack API 6. GEMM Pack API eliminate the cost of packing the input matrixes by avoiding on the go pack in the GEMM 5 loop. Packing of input matrixes are done when there is resue of matrixes across different GEMM calls. AMD-Internal: [CPUPL-3560] Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3	2023-10-04 06:43:59 -04:00
Edward Smyth	24e4d58f92	Tidy zen bli_cntx_init and bli_family files Tidy formatting of config/zen/bli_cntx_init_zen.c and config/zen/bli_family_.c files to make them more consistent with each other and improve readability. AMD-Internal: [CPUPL-3519] Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b	2023-10-04 05:14:39 -04:00
Harsh Dave	df80f40ccd	Fixed incorrect ymm registers usage in FMA operation. - Incorrect ymm registers were used in dgemm SUP edge kernel, while computing FMA operation. - Due to incorrect vector register, it resulted into incorrect result. - Corrected vector registers usage for FMA operation. AMD-Internal: [CPUPL-3964] Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df	2023-10-02 03:20:44 -04:00
Arnav Sharma	f0416cff08	SGEMM SUP Panel Stride Bug Fix - The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the use of panel strides in case of packed matrices, thus resulting in incorrect matrix strides when packing was explicitly enabled using BLIS_PACK_A=1, BLIS_PACK_B=1 or both. - The kernels are updated to use panel strides for traversing both A and B matrix buffers accurately. [AMD-Internal]: CPUPL-3673 Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb	2023-09-27 03:02:24 -04:00
Kiran Varaganti	db4fbfe9a6	Fix compiler error for "inline" functions in LPGEMM bench Application Functions which are declared as "inline" may trigger compiler error "undefined function" This linker error is eliminated by use "static" before "inline". Therefore added "static" before all inline functions. Change-Id: I5952fb71112fc4792011c3e29be930ccfbce4562	2023-09-27 02:26:23 -04:00
jagar	29711dd5a3	Gtestsuite: Updated testings_basics.* to print matrix/vector name AMD-Internal: [CPUPL-2732] Change-Id: I89b4ffc97ea852e66f42b82058af67c16144fbf6	2023-09-26 08:27:19 -04:00
jagar	f96e20b894	Update in CMakeLists.txt to install on windows Updated CMakeLists.txt to copy library and headers into folder mentioned during cmake configuration. Steps to install 1. cmake .. -G ........ -DCMAKE_INSTALL_PREFIX=path_to_install 2. cmake --build . --config Release 3. cmake --install . (install lib and headers) Change-Id: Ic2728209a2e1d181cc92bab08b82a748bec583d4	2023-09-26 07:56:03 -04:00
Harsh Dave	e437469a99	Optimized AVX2 DGEMM SUP edge kernels - For edge kernels which handles the corner cases and specially for cases where there is really small amount of computation to be done, executing FMA efficiently becomes very crucial. - In previous implementation, edge kernels were using same, limited number of vector register to hold FMA result, which indirectly creates dependency on previous FMA to complete before CPU can issue new FMA. - This commit address this issue by using different vector registers that are available at disposal to hold FMA result. - That way we hold FMA results in two sets of vector registers, so that sub-sequent FMA won't have to wait for previous FMA to complete. - At the end of un-rolled K loop these two sets of vector registers are added together to store correct result in intended vector registers. AMD-Internal: [CPUPL-3574] Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54	2023-09-22 03:43:47 -04:00
Edward Smyth	ccb8dd26fd	Compiler warnings when using --int-size=32 Correct compiler warnings when building with configure --int-size=32 - bla_imatcopy.c: Cast ints to longs to match %ld format specification in error printf statement and change this to fprintf to stderr. Also copy this additional fprintf statement to other variants of this function. - bli_type_defs.h: siz_t should always be the same size as a pointer. This corrects an issue in bli_malloc.c when casting from a pointer to a siz_t integer value. AMD-Internal: [CPUPL-3519] Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab	2023-09-19 06:08:19 -04:00
Edward Smyth	15f9a747af	Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore bli_gemmt_sup_var1n2m.c as of commit `10ca8710f0` as variant for non-AMD codepath builds. AMD-Internal: [CPUPL-3838] Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f	2023-09-13 16:03:53 +05:30
Edward Smyth	0d16d952dc	BLIS: DTL enhancements Several improvements to BLIS DTL functionality - For APIs that report performance statistics, test for time=0.0 before dividing by time when calculating GFLOPS. - Call AOCL_DTL_TRACE_EXIT in the parameter checking functions inlined from ./frame/compat/check/bla_*_check.h - Correct flop count for complex routines. AMD-Internal: [CPUPL-3736] Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768	2023-09-11 08:42:11 -04:00
orequest	09e34fd2bd	Added optimised CGEMM function pointers in zen4 cntx 1. Two CGEMM function pointers are added for different storage schemes 1. bli_cgemmsup_rv_zen_asm_3x8m 2. bli_cgemmsup_rv_zen_asm_3x8n 2. In previous commit: (Level-3 triangular routines now use different block sizes and kernels Commit Id: `79e174ff0a`) 1. bli_cntx_set_l3_sup_tri_kers cntx function was created 2. Function holds optimised function pointers for GEMMT/SYRK API's 3. It avoids over riding default block sizes which improves the performance 4. This function did not include optimised CGEMM function pointers leading to regression as reference kernels were invoked 3. With this commit, 2 optimized CGEMM function pointers are added in bli_cntx_set_l3_sup_tri_kers 1. This fixes the regression as optimized CGEMM functions are invoked AMD-Internal: [CPUPL-3831] [CPUPL-3830] Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e	2023-09-11 06:38:31 -04:00
Edward Smyth	9d19effec5	Fix for non-x86 builds: cpuid query functions Ensure functions bli_cpuid_query_id() and bli_cpuid_query_model_id() are defined for all architectures in bli_cpuid.c AMD-Internal: [CPUPL-3838] Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e	2023-09-11 04:19:14 -04:00
Vignesh Balasubramanian	32104c400c	GTestSuite : Designing test cases for ZGEMM - Designed test cases for unit testing of ZGEMM compute kernel for handling inputs when k == 1. The design uses value-parameterized testing for checking accuracy, and verifying the mandate in case of exception values on the inputs/output. - The design uses type-parameterized testing for verifying BLAS standard for invalid input cases, and also for early return scenarios. - Added the function template set_ev_mat( ... ) as part of testinghelpers. This function is used as a helper for inducing exception values onto indices specified as arguments to the test_gemm( ... ) interface. - Abstracted the function definition of getValueString( ... ) from the NRM2 testing interface to testinghelpers(renamed as get_value_string( ... ) for naming consistency), in order to use it as a helper function across all APIs in case of exception value testing. AMD-Internal: [CPUPL-3823] Change-Id: I0fea21f9c8759bbbdc88ba0a016202753e28f2a7	2023-09-08 17:36:57 +05:30

1 2 3 4 5 ...

3063 Commits