amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 18:57:23 +00:00

Author	SHA1	Message	Date
Edward Smyth	eee3fe1b54	GTestSuite: nested parallelism tests Optionally enable parallelism inside gtestsuite so that we can check BLIS functions perform correctly when nested parallelism is in operation. Enable with: cmake ... -DOPENMP_NESTED={0,1,2,1diff} where in gtestsuite - 0 is the default choice with no parallelism. - 1 and 2 are simple nested parallelism. - 1diff has one level of parallelism setting different numbers of threads to be used by BLIS and reference library calls from each gtestsuite thread. Note: OMP_NUM_THREADS must be set appropriately to enable or disable parallelism at each level in the test programs as desired. OMP_NUM_THREADS will also define the parallelism used within the BLIS library (if it is multithreaded), unless BLIS-specific ways of specifying parallelism have been used. If a BLIS-specific parallelism option has been set, the same mechanism will be used in the 1diff option to vary the number of threads in BLIS per application thread. AMD-Internal: [CPUPL-3902] Change-Id: I89f9edb4125c64ef03e025a9f6ccb84960ba8771	2025-02-07 08:49:25 -05:00
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Arnav Sharma	5a4739d288	DGEMV NO_TRANSPOSE Optimizations and Unit Tests - Added 32x3n n-biased kernels to directly handle the cases where n=3 which were earlier being handled by the primary n-biased, 32x8n, kernel. - Modified the n-biased fringe kernels to further handle the smaller m-fringe cases. Thus, now the kernels handle the following range of m for any value of n: - 16x8n : m = [16, 31) - 8x8n : m = [8, 15) - m_leftx8n : m = [1, 7] - Updated the function pointer map for n-biased kernels with added granularity to invoke the smaller fringe cases directly on the basis of m-dimension. - Added micro-kernel unit tests for all the dgemv_n kernels. AMD-Internal: [CPUPL-6231] Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119	2025-02-06 18:52:32 +05:30
Shubham Sharma	f8c83fedb6	Added new ZTRSM small code path for ZEN5 - Added new ZTRSM kernels for right and left variants. - Kernel dimensions are 12x4. - 12x4 ZGEMM SUP kernels are used internally for solving GEMM subproblem. - These kernels do not support conjugate transpose. - Only column major inputs are supported. - Tuned thresholds to pick efficent code path for ZEN5. AMD-Internal: [CPUPL-6356] Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e	2025-02-06 18:01:10 +05:30
Hari Govind S	3d2653f1ab	DDOTV Optimization for ZEN3 Architecture - Reduced the blocking size of 'bli_ddotv_zen_int10' kernel from 40 elements to 20 elements for better utilization of vector registers - Replaced redundant 'for' loops in 'bli_ddotv_zen_int10' kernel with 'if' conditions to handle reminder iterations. As only a single iteration is used when reminder is less than the primary unroll factor. - Added a conditional check to invoke the vectorized DDOTV kernels directly(fast-path), without incurring any additional framework overhead. - The fast-path is taken when the input size is ideal for single-threaded execution. Thus, we avoid the call to bli_nthreads_l1() function to set the ideal number of threads. - Updated getestsuite ukr tests for 'bli_ddotv_zen_int10' kernel. AMD-Internal: [CPUPL-4877] Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2	2025-02-04 06:01:04 -05:00
Edward Smyth	d9cabce0ba	GTestSuite: Catch BLIS version executable errors In case the executable to obtain the BLIS library version fails, catch and report common errors to help with debugging. Also correct the test for bli_info_get_info() support to mark that it is not available in any AOCL version <= 4.1 AMD-Internal: [CPUPL-4500] Change-Id: Ie8f728b49faa60e0469562dbf77d67f86b415cd8	2025-01-28 16:54:05 +05:30
Vignesh Balasubramanian	fb6dcc4edb	Support for Tiny-GEMM interface(ZGEMM) - As part of AOCL-BLAS, there exists a set of vectorized SUP kernels for GEMM, that are performant when invoked in a bare-metal fashion. - Designed a macro-based interface for handling tiny sizes in GEMM, that would utilize there kernels. This is currently instantiated for 'Z' datatype(double-precision complex). - Design breakdown : - Tiny path requires the usage of AVX2 and/or AVX512 SUP kernels, based on the micro-architecture. The decision logic for invoking tiny-path is specific to the micro-architecture. These thresholds are defined in their respective configuration directories(header files). - List of AVX2/AVX512 SUP kernels(lookup table), and their lookup functions are defined in the base-architecture from which the support starts. Since we need to support backward compatibility when defining the lookup table/functions, they are present in the kernels folder(base-architecture). - Defined a new type to be used to create the lookup table and its entries. This type holds the kernel pointer, blocking dimensions and the storage preference. - This design would only require the appropriate thresholds and the associated lookup table to be defined for the other datatypes and micro-architecture support. Thus, is it extensible. - NOTE : The SUP kernels that are listed for Tiny GEMM are m-var kernels. Thus, the blocking in framework is done accordingly. In case of adding the support for n-var, the variant information could be encoded in the object definition. - Added test-cases to validate the interface for functionality(API level tests). Also added exception value tests, which have been disabled due to the SUP kernel optimizations. AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799] Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956	2025-01-24 12:59:26 -05:00
Hari Govind S	106a2b1fe1	Gtestsuite: UKR test for GEMV kernels - Added support for gemv kernels unit test in gtestsuite. - Added micro-kernel tests and memory tests for DGEMV transpose case kernels. AMD-Internal: [CPUPL-5835] Change-Id: I7d2d3cdbfea436f6c9b2cce9f2e85bfc5c51f201	2025-01-24 05:09:33 -05:00
Vignesh Balasubramanian	a80436ab21	Standardizing the EVT compliance of {S/D}AMAXV API - Updated the existing AVX2 {S/D}AMAXV kernels to comply to the standard when having exception values. This makes it exhibit the same behaviour as it AVX512 variants. Provided additional optimizations with loop unrolling. - Removed redundant early return checks inside the kernels, since they have been abstracted to a higher layer. - Updated the unit-tests(micro-kernel) and exception value tests for appropriate code-coverage. Also re-enabled the exception value tests. AMD-Internal: [CPUPL-4745] Change-Id: I36c793220bd4977a00281af9737c51cd1e5c60d9	2025-01-13 06:56:31 -05:00
Edward Smyth	0ae5a0492f	GTestSuite: fix to ukr tests for dgemm avx512 8x24 kernels - Restore test for old bli_dgemm_zen4_asm_8x24 kernel, so that we can test this if linking with older AOCL versions. - Move K_bli_dgemm_avx512_asm_8x24 definition from AOCL_42 list to AOCL_50 list. AMD-Internal: [CPUPL-4500] Change-Id: Id522f4bc5b89e86f77c4e1d26c75e261736ab450	2025-01-10 12:33:15 -05:00
Vignesh Balasubramanian	cdaa2ac7fd	Bugfix and optimizations for AVX512 AMAXV micro-kernels - Bug : The current {S/D}AMAXV AVX512 kernels produced an incorrect functionality with multiple absolute maximums. They returned the last index when having multiple occurences, instead of the first one. - Implemented a bug-fix to handle this issue on these AVX512 kernels. Also ensured that the kernels are compliant with the standard when handling exception values. - Further optimized the code by decoupling the logic to find the maximum element and its search space for index. This way, we use lesser latency instructions to compute the maximum first. - Updated the unit-tests, exception value tests and early return tests for the API to ensure code-coverage. AMD-Internal: [CPUPL-4745] Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f	2025-01-07 22:56:20 +05:30
harsdave	54b46ec1ed	Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop efficiency and edge kernel performance. The following technical improvements have been implemented: 1. IR Loop Optimization: - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated with `begin_asm` and `end_asm` calls, resulting in more efficient execution. 2. JR Loop Integration: - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead of stack frame management for each JR iteration, thereby enhancing loop performance. 3. Kernel Decomposition Strategy: - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1. - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently. 1. Interleaved Scaling by Alpha: - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline and reduce latency. 2. Efficient Mask Preparation: - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary, minimizing unnecessary overhead. 3. Broadcast Instruction Optimization: - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse, the broadcast instruction is replaced with `mem_1to8`. - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding dependency chains and improving execution efficiency. 4. C Matrix Update Optimization: - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers. This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating performance bottlenecks and enhancing throughput. These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication operations. This patch also involves changes for tiny gemm interface. A light interface for calling kernels and removing calls to avx2 dgemm kernels as we use avx512 dgemm kernels for all the sizes for zen4 and zen5. For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have the support to handle such inputs and thus such inputs are routed to gemm_small path. AMD-Internal: [CPUPL-6054] Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a	2024-12-13 00:03:00 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00
Shubham Sharma	f2320a1fef	Enabled DGEMM row major kernel for ZEN4 - Merged ZEN4 and ZEN5 DGEMM 8x24 kernel. - Replaced 32x6 kernel with 8x24. Now same kernel is used for ZEN4 and ZEN5. - Blocksizes have been tuned for genoa only. - DGEMM kernel for DTRSM native code path is replaced with 8x24 kernel. - Enabled alpha scaling during packing for ZEN4. - ZEN4 8x24 kernel has been removed. AMD-Internal: [CPUPL-5912] Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754	2024-11-29 08:18:48 +00:00
Edward Smyth	971c890fc6	GTestSuite: Select ukr tests by BLIS version Add definitions in gtestsuite header to list available kernel by AOCL BLIS version. Check these definitions in ukr test programs to avoid missing symbol errors when testing with an older version of BLIS. Currently AOCL_41, AOCL_42, AOCL_50 and AOCL_DEV are supported, with AOCL_DEV inferred from the version being later than the value of AOCL_BLAS_LATEST_VERSION set in CMakeLists.txt. Thanks to Eleni Vlachopoulou for the cmake functionality to automatically detect the version from the library. AMD-Internal: [CPUPL-4500] Change-Id: I40ffd3d3789324fbb1dabfbf5e1dd4e0c94d54d9	2024-11-15 10:07:29 -05:00
Edward Smyth	0249f57022	GTestSuite: Correct blis_impl calls for gemm_compute gemm_compute currently has differences in the interface to the blis_impl layer compared to the top-level API. Modify gtestsuite wrapper to account for this. AMD-Internal: [CPUPL-4500] Change-Id: Ie96c9ac3b23128ae8e03af34ad11e65910dec594	2024-11-12 06:57:59 -05:00
Edward Smyth	ef5cbf7c9a	GTestSuite: More threshold changes Various changes from testing code paths with both gcc and aocc. AMD-Internal: [CPUPL-4378] Change-Id: I8964d8ab4e1f5669026af606598c2eb3dfddde16	2024-11-11 12:34:05 -05:00
Vignesh Balasubramanian	06d776b025	AVX512 ZGEMM SUP Inner product kernels - Implemented a set of column preferential dot-product based ZGEMM kernels(main and fringe) in AVX512(for SUP code-path). These kernels perform matrix multiplication as a sequence of inner products(i.e, dot-products). - These standalone kernels are expected to strictly handle the CRC storage scheme for C, A and B matrices. RRC is also supported through operation transpose, at the framework level. - Added unit-tests to test all the kernels(main and fringe), as well as the redirection between these kernels. AMD-Internal: [CPUPL-5949] Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38	2024-11-06 04:18:57 -05:00
Edward Smyth	9ce2696fc9	GTestSuite: Fix builds testing against MKL Correction to CMakeLists.txt to fix problem building executables when testing against MKL. AMD-Internal: [CPUPL-5928] Change-Id: Ie427fff0afb48be6ce6d940b1db2c9d1c7a40e5b	2024-10-29 06:32:27 -04:00
Edward Smyth	cffb501e00	GTestSuite: ILP64 build fix Cast literal 0 to match integer size in std::max tests. AMD-Internal: [CPUPL-4500] Change-Id: I330aafd8669884c5e1900b95742b5d1e4ce8ddfa	2024-10-29 06:10:49 -04:00
Eleni Vlachopoulou	d6a411d6b6	GTestSuite: Reorganizing some tests - Breaking tests to smaller executables. - Removing some redundant tests. AMD-Internal: [CPUPL-4500] Change-Id: I6288c3fcf5194ccb5de3485ca1ad95a20414208c	2024-10-02 11:48:18 -04:00
Eleni Vlachopoulou	72536e56ba	GTestSuite: Reducing gemm tests. Since there is thorough kernel testing, we reduce the number of "Black Box" test cases so that CI is faster. AMD-Internal: [CPUPL-4500] Change-Id: Ie57eeccff8103c0051eb1904162d6447da0ef102	2024-09-19 12:17:20 -04:00
Edward Smyth	6330ac6a52	GTestSuite: Misc changes - Correct matsize and NumericalComparison functions for tests with first matrix dimension <= 0. - BLAS1: - Fix for BLAS vs CBLAS differences in amaxv IIT_ERS tests. - Threshold adjustments in ddotxf and zaxpy. - Break axpyv and scalv into separate executables for each data type. - BLAS2: - Threshold adjustments in symv and hemv. - Break ger into separate executables for each data type. - UKR: - Break gemm and trsm ukr test into separate executables for each data type. - Threshold adjustments in daxpyf - Disable {z,c}trsm ukr tests when BLIS_INT_ELEMENT_TYPE is used, as matrix generator is not currently suitable for this. AMD-Internal: [CPUPL-4500] Change-Id: I1d9e7acc11025f1478b8b511c14def5517ef0ae6	2024-09-19 10:17:36 -04:00
Eleni Vlachopoulou	c7a5d04d4d	GTestSuite: Disabling falling tests. Those can be run in --gtest_also_run_disabled_tests is used. Bugs will be addressed and resolved in the future. AMD-Internal: [CPUPL-4500] Change-Id: I7a5443606ea8ef20f18ff8beec14bece5f6ee661	2024-09-18 13:12:35 +01:00
Edward Smyth	54f8fb951e	GTestSuite: BLAS2 test case selection Various changes to BLAS2 test cases: - GEMV: Reduce number of tests to make runtime more reasonable. - TRSV: - Standardize tests across different data types, including adding memory testing for all variants. - Improve scaling when making matrix A diagonally dominant and avoid singular matrix when BLIS_INT_ELEMENT_TYPE is used. - TRMV: Copy TRSV generic tests. - Expand set of tests for HEMV, HER, HER2, SYMV, SYR, SYR2 and make lda contribution to test names consistent with others routines. - Various adjustments to thresholds added. Update gtestsuite documentation to describe using GTEST_FILTER environment variable to select tests to run or exclude. This works particularly well when using ctest, as we do not enumerate all the tests at this level and so need to pass the selection down to the individual executables. AMD-Internal: [CPUPL-4500] Change-Id: Ifcb6410455b7f91e58b555f94b9fd7920d7ad9d9	2024-09-17 09:35:29 -04:00
Edward Smyth	61c6f1ad78	GTestSuite:a Fix alpha and beta input argument tests Check if alpha and beta are null before testing values. This avoids possible seg faults if alpha or beta have not been defined in IIT tests. AMD-Internal: [CPUPL-4500] Change-Id: Ibbf2d6a8fb38d9a95033f3fec3d06c3441e98689	2024-09-17 09:00:09 -04:00
Edward Smyth	8d4881c4fd	GTestSuite: add option to test blis_impl layer Add BLAS_TEST_IMPL option for TEST_INTERFACE to test the wrapper layer underneath BLAS and CBLAS interfaces. This is particularly useful if building a BLIS library with these interfaces disabled, e.g. ./configure --disable-blas amdzen or cmake . -DENABLE_BLAS=OFF -DBLIS_CONFIG_FAMILY=amdzen The ?_blis_impl wrappers should have the same arguments as the BLAS interfaces, thus we define TEST_BLAS_LIKE as an additional definition for convenience when selecting tests and options in the C++ files. AMD-Internal: [CPUPL-5650] Change-Id: I0275a387563f3efc2b40029950c8569956f2df7b	2024-09-16 09:53:56 -04:00
Edward Smyth	a07e041b1f	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae	2024-09-16 07:10:28 -04:00
Edward Smyth	3a6d367f9c	GTestSuite: Fix TRSM ukr tests in non-zen builds Add guards around bli_trsm_small kernel tests to only call them if BLIS_ENABLE_SMALL_MATRIX_TRSM is defined. This fixes missing symbol errors in tests of non-zen builds, e.g. generic or skx. AMD-Internal: [CPUPL-4500] Change-Id: I7a822a41b5f686b5e38b0c63dd1871963e990407	2024-08-21 07:45:06 -04:00
Chandrashekara K R	545f9ee44e	CMake: Updated cmake minimum version to be supported to 3.22.0 to maintain uniform across all AOCL libraries. AMD Internal : [CPUPL-5616] Change-Id: Ic53532ff9883b1bba39e859ea2523c20c1ac383b	2024-08-21 12:09:24 +05:30
Vignesh Balasubramanian	93631410a3	Bugfix : Fixed memory accesses in AVX512 SGEMMSUP RD kernels - Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction that strictly needed YMMs. - Further updated the existing micro-kernel test cases to simulate these issues and validate the fix. AMD-Internal: [CPUPL-5353] Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1	2024-08-19 17:18:31 +05:30
Arnav Sharma	a67c8f05fb	Gtestsuite: Fix for GEMM_COMPUTE IIT_ERS Test - The IIT_ERS test for GEMM_COMPUTE where alpha = 0 and beta = 0 was failing since neither of the matrices was being packed and thus, missing the scaling by alpha resulting in a non-zero output for C matrix (C := A * B). - Enabled packing of A matrix for the ZeroAlpha_ZeroBeta IIT_ERS test which handles the alpha scaling. AMD-Internal: [CPUPL-5598] Change-Id: Id9179ec6150d1bc5a0274edce727ce6cc4172213	2024-08-13 17:24:27 +05:30
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	b964308e50	GTestSuite: option to check input arguments Add tests to check input arguments have not been modified by BLIS routine. These tests add a large runtime overhead, so they are disabled by default. To enable them, configure gtestsuite with: cmake -DTEST_INPUT_ARGS=ON ... and run desired tests as normal. Also: - Correct testinghelpers::chktrans to handle upper case values of argument trns. - Change testinghelpers::matsize to return size 0 if m, n or leading dimension are 0, or if leading dimension is too small. AMD-Internal: [CPUPL-4379] Change-Id: I9494af800f9383195272ce99f622104a38fd0ed8	2024-08-05 09:58:17 -04:00
Edward Smyth	6393cb9d7c	GTestSuite: misc corrections 3 - Set threshold to epsilon for early return cases where we are just scaling a matrix. - Add this threshold to IIT_ERS files for appropriate tests. - In IIT_ERS for gemm_compute, remove tests on null A and B when we are expecting to set or scale C. More thought is required in gemm_compute tests to handle these cases and look at cases where A or B has been packed. AMD-Internal: [CPUPL-4500] Change-Id: Ia649cc340ca1df6511388f9c43a31e53296cb2bf	2024-08-05 09:31:18 -04:00
Arnav Sharma	0a5c057475	DGEMV Optimizations for Tiny Sizes - Added reference kernel for dgemv that handles computation for tiny sizes (m < 8 && n < 8). - The reference kernel, bli_dgemv_zen_ref( ... ), supports both row/column storage schemes as well as transpose and no transpose cases. - Added additional unit-tests for functional verification. AMD-Internal: [CPUPL-5098] Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada	2024-08-05 12:19:42 +05:30
Ruchika Ashtankar	bdb94fb218	GTestSuite: Added tests for DGEMM SUP kernel - Added dgemmGenericSUP test for the new 24x8 DGEMM SUP kernel for zen5. AMD-Internal: [CPUPL-4404] Change-Id: I150ca310655a495bdcf5ea9d5a16746483a17b68	2024-08-02 11:37:29 -04:00
Edward Smyth	75f21182bd	GTestSuite: IIT and ERS test improvements Various improvements: - Where appropriate, test both: - with nullptr for suitable arguments that should never be touched. - with all arguments correct except the one we want to test, to check we are not returning early because another argument is a nullptr. - Test incorrect values for order argument in CBLAS calls. - Test early exits with limited data changes, e.g. set C to 0 or scale C in GEMM when alpha = 0. - Bugfix in gemmt test when alpha is 0 and beta is 1. - Use reference library gemmt for comparison when library is not netlib BLAS. AMD-Internal: [CPUPL-4500] Change-Id: Ibde7eaba5a484a87674044ca44855c6f6ee4ff4b	2024-07-31 15:36:01 -04:00
Edward Smyth	b90e12dfa4	GTestSuite: copyright notice Standardize format of copyright notice. AMD-Internal: [CPUPL-4500] Change-Id: I6bde64c15ff639492dd0de95423c660112a37e2c	2024-07-26 15:34:41 -04:00
Edward Smyth	ea286cf6f6	GTestSuite: whitespace at end of lines Unnecessary whitespace (spaces, tabs) at the end of lines has been removed. AMD-Internal: [CPUPL-4500] Change-Id: Ice5f5504232cb22460c14ac47e6a3a43309cba22	2024-07-26 12:12:56 -04:00
Edward Smyth	4183efa722	GTestSuite: No newline at end of file Add missing newline at the end of these files. AMD-Internal: [CPUPL-4500] Change-Id: I835cc73de0008b66ae3cf77fbb3daa1c8fcaaa7f	2024-07-26 11:42:57 -04:00
Edward Smyth	46fe3f3dcb	GTestSuite: dos2unix file conversion Source and other files in some directories were a mixture of Unix and DOS file formats. Convert all relevant files to Unix format for consistency. AMD-Internal: [CPUPL-4500] Change-Id: Ia3e479643b0bed4ae8a9107bde6e2cddf32d5bd8	2024-07-26 11:09:06 -04:00
Arnav Sharma	9583ee2e23	DGEMV Optimizations for NO_TRANSPOSE cases - Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases. - Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16. - Added a wrapper for DAXPYF kernels for redirection to kernels with a smaller fuse factor than 32. - Also added UKR tests for the new fused kernels. AMD-Internal: [CPUPL-5098] Change-Id: I0b102b67c6c068873393bac0494284f379c253f2	2024-07-24 15:59:36 +05:30
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00
Vignesh Balasubramanian	02da190560	AVX512 optimizations for DNRM2 - Implemented bli_dnorm2fv_unb_var1_avx512( ... ) AVX512 computational kernel for DNRM2 API. - Updated the header to include this kernel signature, as well as the framework layer to use this function in case of ZEN4 and ZEN5 configurations. - Updated the tipping points for ideal thread setting in DNRM2 for ZEN5 micro-architecture. These thresholds are specific to the library's linkage to LLVM's OpenMP or GNU's OpenMp. - Further abstracted the AOCL-DYNAMIC logic to separate functions for ?NRM2 APIs that currently support it(namely, DNRM2 and ZNRM2). - Further updated the ?NRM2 framework to accommodate the necessary changes to invoke the newer AOCL-DYNAMIC functions and the AVX512 kernel, when needed. - Added micro-kernel and memory tests for this kernel in GTestsuite, to validate accuracy and out-of-bounds read and write. AMD-Internal: [CPUPL-5265] Change-Id: I4fc0d0f1e6906bf27d46562ca387c338cc4d2049	2024-06-24 08:50:36 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Mangala V	90fe795c46	Gtestsuite: Enabled memory test for ZGEMM for k=0 AMD_Internal: [CPUPL-4657] Change-Id: Ic5f4d24184f05e0f57634845b4fb3312b3a416f6	2024-06-20 02:51:47 -04:00

1 2 3 4

186 Commits