amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 18:34:40 +00:00

Author	SHA1	Message	Date
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Shubham Sharma	75df1ef218	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in CMAKE build system only. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsics then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4	2024-07-19 00:49:52 -04:00
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
Hari Govind S	627bf0b1ba	Implemented Multithreading and Enabled AVX512 Kernel for ZAXPY API - Replaced 'bli_zaxpyv_zen_int5' kernel with optimised 'bli_zaxpyv_zen_int_avx512' kernel for zen4 and zen5 config. - Implemented multithreading support and AOCL-dynamic for ZAXPY API. - Utilized 'bli_thread_range_sub' function to achieve better work distribution and avoid false sharing. AMD-Internal: [CPUPL-5250] Change-Id: I46ad8f01f9d639e0baa78f4475d6e86458d8069b	2024-07-09 01:29:53 -04:00
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
mkadavil	a5c4a8c7e0	Int4 B matrix reordering support in LPGEMM. Support for reordering B matrix of datatype int4 as per the pack schema requirements of u8s8s32 kernel. Vectorized int4_t -> int8_t conversion implemented via leveraging the vpmultishiftqb instruction. The reordered B matrix will then be used in the u8s8s32o<s32\|s8> api. AMD-Internal: [SWLCSG-2390] Change-Id: I3a8f8aba30cac0c4828a31f1d27fa1b45ea07bba	2024-06-24 07:55:34 -04:00
Vignesh Balasubramanian	6165001658	Bugfix and optimizations for ?AXPBYV API - Updated the existing code-path for ?AXPBYV to reroute the inputs to the appropriate L1 kernel, based on the alpha and beta value. This is done in order to utilize sensible optimizations with regards to the compute and memory operations. - Updated the typed API interface for ?AXPBYV to include an early exit condition(when n is 0, or when alpha is 0 and beta is 1). Further updated this layer to query the right kernel from context, based on the input values of alpha and beta. - Added the necessary L1 vector kernels(i.e, ?SETV, ?ADDV, ?SCALV, ?SCAL2V and ?COPYV) to be used as part of special case handling in ?AXPBYV. - Moved the early return with negative increments from ?SCAL2V kernels to its typed API interface. - Updated the zen, zen2 and zen3 context to include function pointers for all these vector kernels. - Updated the existing ?AXPBYV vector kernels to handle only the required computation. Additional cleanup was done to these kernels. - Added accuracy and memory tests for AVX2 kernels of ?SETV ?COPYV, ?ADDV, ?SCALV, ?SCAL2V, ?AXPYV and ?AXPBYV APIs - Updated the existing thresholds in ?AXPBYV tests for complex types. This is due to the fact that every complex multiplication involves two mul ops and one add op. Further added test-cases for API level accuracy check, that includes special cases of alpha and beta. - Decomposed the reference call to ?AXPBYV with several other L1 BLAS APIs(in case of the reference not supporting its own ?AXPBYV API). The decomposition is done to match the exact operations that is done in BLIS based on alpha and/or beta values. This ensures that we test for our own compliance. AMD-Internal: [CPUPL-4861] Change-Id: Ia6d48f12f059f52b31c0bef6c75f47fd364952c6	2024-06-20 16:22:07 +05:30
Shubham Sharma.	580282e655	DGEMM optimizations for Turin Classic - Introduced new 8x24 row preferred kernel for zen5. - Kernel supports row/col/gen storage schemes. - Prefetch of current panel of A and C are enabled. - Prefetch of next panel of B is enabled. - Kernel supports negative offsets for A and B matrices. - Cache block tuning is done for zen5 core. AMD-Internal: [CPUPL-5262] Change-Id: I058ea7e1b751c20c516d7b27a1f27cef96ef730f	2024-06-17 05:18:49 -04:00
Mangala V	64d9c96d45	ZGEMMT SUP: AVX512 GEMMT code for Upper variant 1. Enabled AVX512 path for - Upper variant - Different storage schemes for upper and lower variant 2. Modified mask value to handle all fringe cases correctly AMD_Internal: [CPUPL-5091] Change-Id: I4bf8aca24c1b87fff606deb05918b8e6216b729e	2024-05-15 13:08:32 +05:30
Hari Govind S	61d0f3b873	Additional optimisations on COPYV API - Reduced number of jump operations in AVX512 assembly kernel for SCOPYV, DCOPYV and ZCOPYV. - Fixed memory test failure for bli_zcopyv_zen_int_avx512 kernel. - Replaced existing AVX2 COPYV intrinsic kernels in bli_cntx_init_zen5.c with AVX512 assembly kernels. Change-Id: Idc11601b526d6d82cfbdf63af2fd331918b31159	2024-05-10 07:22:04 -04:00
Arnav Sharma	cb27fad49c	ZSCALV AVX512 Kernel - Implemented ZSCALV kernel utilizing AVX512 intrinsics. - Gtestsuite: Added ukr tests for the new kernel. AMD-Internal: [CPUPL-5012] Change-Id: I75c7f4448ddd60b0f9afa53936eed37f5f99eeb2	2024-05-08 11:55:13 -04:00
Arnav Sharma	1dbeee4d19	ZDOTV AVX512 Kernel with MT Support - Added AVX512 kernel for ZDOTV. - Multithreaded both ZDOTC and ZDOTU with AOCL_DYNAMIC support. AMD-Internal: [CPUPL-5011] Change-Id: I56df9c07ab3b8df06267a99835b088dcada81bd8	2024-05-08 04:54:05 -04:00
Mangala V	e6cc2a3e22	ZGEMMT SUP Optimizations for AVX512 Existing Design: - GEMM AVX2 kernel performs computation and updates temporary C buffer - Portion of temporary C buffer is copied to output C buffer based on UPLO parameter - For diagonal blocks, using GEMM kernels is not efficient New Design: Implemented in current patch when UPLO='L' - GEMMT kernel used for computation, temporary buffer is not required. - Only required elements are computed using mask load store for all fringe cases - Exception: AVX2 code path is used when storage format is RRC, CRR, CRC - AOCL-Dynamic is added based on dimension - Check for AVX platform is added in SUP interface, It returns to native implementation if hardware doesnot support AVX platform - SUP ref_var2m is expanded for dcomplex datatype to avoid condition check which exists for double datatype AMD_Internal: [CPUPL-5006] Change-Id: I3e21404b732b8f2df9cbdba394303752fdf36286	2024-05-07 23:00:29 +05:30
Shubham Sharma	b70347d0d4	DGEMMT SUP Optimizations for AVX512 - In DGEMMT SUP AVX2 code path, traingular kernels are added in order to avoid temporary C buffer. - Since these kernels did not exist for AVX512, AVX2 kernels were being used in GEMMT. - AVX512 triangular GEMM kernel has been added to make sure that AVX512 kernels can be used without creating a temporary buffer. - This kernel is added only for Lower variant of GEMMT, for upper variant of DGEMMT, temporary C buffer is created, full GEMM kernel is called on temporary C and traingular region from temporary C is copied to C buffer. AMD-Internal: [CPUPL-4881] Change-Id: Id70645f79ae078ab9a7006e83d328505f1fae8a9	2024-05-03 05:11:11 -04:00
Vignesh Balasubramanian	53cb83d0cc	AVX512 optimizations for ZGEMV API with no-transpose case - Implemented AVX512 kernels for handling the calls to ZGEMV with no-transpose to A matrix. - This includes the ZAXPYF, ZAXPYV and ZSETV kernels. The set of ZAXPYF kernels include those with fuse-factor 8 (main kernel), 4 and 2(fringe kernels). - Updated the bli_zgemv_unf_var2( ... ) function to set the function pointers to these kernels, based on the configuration. Further added the call to ZSETV at this layer in case beta is 0. AMD-Internal: [CPUPL-4974] Change-Id: Iee4b724719e49023138bb16479765be44d677cd9	2024-05-03 07:04:47 +00:00
Hari Govind	9c26de1a18	Optimisiation COPYV APIs - Implemented AVX512 kernels for scopyv_, dcopyv_ and zcopyv_ using respective AVX512 intrinsics including masked load and store operations. - Implemented AVX512 kernels for scopy_, dcopy_ and zcopy_ using assembly language to prevent loss of performance during the translation of intrinsics. - Updated the dcopy_blis_impl( ... ) and zcopy_blis_impl( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Implemented OpenMP parallelization for dcopyv_ and zcopyv_ APIs, while scopyv_ and ccopyv_ only support single thread. AMD-Internal: [CPUPL-4854] Change-Id: I5fbd0bcca4e59001fbe2b1168b624d0c33242b3e	2024-05-01 00:23:01 +05:30
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Eleni Vlachopoulou	020b9ff7f0	CMake: Enable builds for both static and shared builds for Linux. - Added BUILD_STATIC_LIBS option which is on by default, only on Linux. - Added TEST_WITH_SHARED option which is off by default, only on Linux. - If only shared or static lib is being built, that's the one that will be used for testing. - If both are being built, TEST_WITH_SHARED determins which library wil be used for testing. - Set linux workflows so that they build both static and shared libs, and use linux-static and linux-shared to denote which one should be used for testing. - Set -fPIC for both static and shared builds to fix issues faced when building blis using AOCC 4.0.0 and gtestsuite using gcc 9.4.0. AMD-Internal: [CPUPL-2748] Change-Id: I4227bab97ff31ecddfe218e18499f33b4e4ee63e	2024-03-14 10:32:51 -04:00
jagar	e2de45b454	CMake:Added support for ADDON(aocl_gemm) on Windows CMakelists.txt is updated to support aocl_gemm on windows. On windows, BLIS library(blis+aocl_gemm) is built successfully only with AOCC Compiler. (Clang has an issue with optimizing VNNI instructions). $cmake .. -DENABLE_ADDON="aocl_gemm" .... AMD-Internal: [CPUPL-2748] Change-Id: I9620878ab6934233fadc9ddc5d5e82ad85be9209	2024-03-14 07:57:02 -04:00
jagar	bbfa4a88ec	CMake: Updated compiler ID in cmake files Updated compiler id in cmake related files from CMAKE_CXX_COMPILER_ID to CMAKE_C_COMPILER_ID AMD-Internal: [CPUPL-2748] Change-Id: Ib0e2a2e3ec8fafeb423fe56b9842a93db0115371	2024-03-14 07:24:04 -04:00
Kiran Varaganti	0784679d4d	Fix gcc 7.5 compilation error for zen4 and above configs For gcc greater than or equal to 7.0 version added AVX512 compiler flags in makde_defs.mk and make_defs.cmake. AVX512VNNI compiler flag is only supported from gcc version 8 or greater. So added another else condition for gcc version greater than or equal to 7 - enabling avx512 flags. This enables compilation of AVX512 assembly code paths with gcc 7.5 version. Change-Id: I2cda00e578010db5e5a515b506c0b99f685307e0	2024-02-26 05:20:35 -05:00
mkadavil	01b7f8c945	Matrix Add post-operation support for integer(s16\|s32) LPGEMM APIs. -This post-operation computes C = (betaC + alphaA*B) + D, where D is a matrix with dimensions and data type the same as that of C matrix. -For clang compilers (including aocc), -march=znver1 is not enabled for zen kernels. Have updated CKVECFLAGS to capture the same. AMD-Internal: [SWLCSG-2424] Change-Id: Ie369f7ea5c80ab69eea3f3e03a8d9546e14f5c09	2024-02-12 23:51:36 +05:30
Edward Smyth	f93ccb0cea	BLIS: zen5 cpuid and arch changes Implement initial support for Zen5 systems: - Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B instructions. - Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5 will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but different hardware model will still be detected. AMD-Internal: [CPUPL-3518] Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600	2024-01-17 11:41:15 -05:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Shubham Sharma	ffa8f584be	Added ZTRSM AVX512 native path kernels - Added 4x12 ZGEMM row-preferred kernel. - Added 4x12 ZTRSM row-preferred lower and upper kernels using AVX512 ISA. - These kernels are used for ZTRSM only, zgemm still uses 12x4 kernel. - Kernels support row/col/gen storage. - Kernels support A prefetch, B prefetch, A_next prefetch, B_next prefetch and c prefetch. - B prefetch, B_next prefetch and C prefetch are enabled by default. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a	2023-11-03 09:42:24 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Shubham Sharma	9a2a4151ac	Added improved ZTRSM AVX2 kernels - Added 2x6 ZGEMM row-preferred kernel. - Kernel supports prefetch_a, prefetch_b, prefetch_a_next and prefetch_b_next. - Multiple Ways to prefetch c are supported. - prefetch_a and prefetch_c are enabled by default. - K loop is divided into multiple subloops for better c prefetch. - Added 2x6 ZTRSM row-preferred lower and upper kernels using AVX2 ISA. - These kernels are used for ZTRSM only, zgemm still uses 3x4 kernel. - Kernels support row/col/gen storage. - Updated the zen3 and zen4 config to enable use of these kernels for TRSM in zen3 and zen4 path. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73	2023-10-13 07:43:33 -04:00
Meghana Vankadari	3a71550bc3	Enabling SUP blocksizes & kernels for generic config Details: - pack and compute extension APIs derive blocksizes(MR, NR...) from SUP cntx. - SUP blocksizes are not set for generic/skx configs. As a result pack and compute APIs cause floating point exceptions. - To fix these issues, we have enabled non-zero SUP blocksizes for generic config and zen4 SUP blocksizes for skx config. - However, these changes will not enable SUP path for skx/generic config as thresholds are set to zero. - To enable SUP path for skx config, more work is needed like non-zero thresholds and modifications to build system. Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9	2023-10-12 05:27:10 -04:00
Edward Smyth	85f2bf6c4a	Fix for x86_64 builds Configuration x86_64 includes all Intel and AMD sub-configurations. Fixes to enable this to work correctly again are: - In config_registry use amdzen rather than amd64 in x86_64 family. - Copy settings from config/amdzen/bli_family_amdzen.h to config/x86_64/bli_family_x86_64.h - Modify configure to set enable_aocl_zen=yes for x86_64, but not for amd64_legacy. - Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are enabled. Note: sub-configurations knl and bulldozer use instructions that are not supported on most x86_64 processors. AMD-Internal: [CPUPL-3838] Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04	2023-10-09 07:24:21 -04:00
Edward Smyth	24e4d58f92	Tidy zen bli_cntx_init and bli_family files Tidy formatting of config/zen/bli_cntx_init_zen.c and config/zen/bli_family_.c files to make them more consistent with each other and improve readability. AMD-Internal: [CPUPL-3519] Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b	2023-10-04 05:14:39 -04:00
orequest	09e34fd2bd	Added optimised CGEMM function pointers in zen4 cntx 1. Two CGEMM function pointers are added for different storage schemes 1. bli_cgemmsup_rv_zen_asm_3x8m 2. bli_cgemmsup_rv_zen_asm_3x8n 2. In previous commit: (Level-3 triangular routines now use different block sizes and kernels Commit Id: `79e174ff0a`) 1. bli_cntx_set_l3_sup_tri_kers cntx function was created 2. Function holds optimised function pointers for GEMMT/SYRK API's 3. It avoids over riding default block sizes which improves the performance 4. This function did not include optimised CGEMM function pointers leading to regression as reference kernels were invoked 3. With this commit, 2 optimized CGEMM function pointers are added in bli_cntx_set_l3_sup_tri_kers 1. This fixes the regression as optimized CGEMM functions are invoked AMD-Internal: [CPUPL-3831] [CPUPL-3830] Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e	2023-09-11 06:38:31 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Edward Smyth	6911d2dd21	zen config make_defs.mk improvements Improvements to zen make_defs.mk files: * Add -znver4 flag for GCC 13 and later. * Add AVX512 flags or -znver4 as appropriate for upstream LLVM in config/zen4/make_defs.mk to enable BLIS to be build with LLVM rather than AOCC. * zen make_defs.mk files were inheriting settings from the previous one (zen->zen2->zen3->zen4), when they should be independent of each other. Correct by including config/zen/amd_config.mk in all zen make_defs.mk files to reinitialize the compiler flags. * Update zen2 and zen3 make_defs.mk for recent AOCC compiler releases, rather than rely on LLVM settings. * Remove -mfpmath=sse flag in config/zen4/make_defs.mk as this is already specified in amd_config.mk (and should be the default setting anyway). * Tidy files to simplify nested if structures and be more consistent with one another. AMD-Internal: [CPUPL-3399] Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac	2023-05-22 07:51:41 -04:00
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Harihara Sudhan S	ada88e3695	Mismatch in fuse factor and kernel fuse - In Zen 4 context, there was a mismatch between the fuse factor initialized in the block size parameter and fuse factor of the corresponding kernel initialized. AMD-Internal: [SWLCSG-2051] Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d	2023-04-21 06:30:11 -04:00
Arnav Sharma	4aace5f524	Smart Threading for SGEMM SUP for Zen4 Architecture - Added Smart Threading logic for AVX-512 based SGEMM SUP. - Calculating ic and jc for optimal work distribution to the allocated threads based on logic similar to Zen3. - Zen4 Architecture specific Native-to-SUP check has been added to redirect few Native inputs to the SUP path based on the fact that in a multi-threaded environment some Native cases perfom better as SUP. - For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have been increased from 512 and 200 to 682 and 512, respectively. - Further optimizations to the work distribution logic will be added subsequently. AMD-Internal: [CPUPL-3248] Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca	2023-04-21 12:54:03 +05:30
Edward Smyth	b531022bac	BLIS cpuid: distinguish submodels within a microarchitecture Incorporate a means of detecting submodels of a microarchitecture, so that different optimizations e.g. block sizes or kernel choices can be used. The details are as follows: - Different models are currently only enabled for zen3 and zen4 architectures (for server parts). - There is a single enumeration (model_t) for all models for all architectures, but function bli_check_valid_model_id() should check the provided model_id against the suitable range within the enumeration for the provided arch_id. - To enable the model_id to be used within the cntx setup functions, checking of a user specified value of BLIS_ARCH_TYPE against the enabled configurations is delayed to a separate function, bli_arch_check_id(). - Default selection based on hardware can be overridden using the BLIS_MODEL_TYPE environment variable. Valid values are: Genoa, Bergamo, Genoa-X, Milan, Milan-X Values are case-insensitive and -X can also be specified as _X or X - Specifying an incorrect value for BLIS_MODEL_TYPE is not an error, but will result in the default option for that architecture being selected. This is different to specifying an incorrect value of BLIS_ARCH_TYPE, which is an error. - The environment variable BLIS_MODEL_TYPE can be renamed using the --rename-blis-model-type argument to configure (or cmake equivalent), in a similar way to renaming BLIS_ARCH_TYPE with --rename-blis-arch-type. - Configure option --disable-blis-arch-type will disable both BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables. - Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes, currently only for AMD cpus. Functions are provided to query these from other parts of the code, namely: uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size() AMD-Internal: [CPUPL-3033] Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702	2023-04-20 10:47:44 -04:00
Meghana Vankadari	f788618f27	Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config Details: - Overriding of blocksizes with avx-2 specific ones(6x8) is done for gemmt/syrk because near-to-square shaped kernel performs better than skewed/rectangular shaped kernel. - Overriding is done for S,D and Z datatypes. AMD-Internal: [CPUPL-3060] Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53	2023-04-20 08:52:34 -04:00
Meghana Vankadari	42d05a5aa0	DGEMM: Added decision logic to choose between sup vs native for zen4 architecture Details: - Added a new function for choosing between SUP and native implementation for a given size. - This function pointer is stored in cntx for zen4 config. - Divided total combinations of sizes into 3 categories: - one dimension is small - Two dimensions are small - All dimensions are small - Added different threshold conditions for each of the categories. AMD-Internal: [CPUPL-2755] Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf	2023-04-17 13:08:34 -04:00
Harihara Sudhan S	15bd0f9646	Added AVX512 based double and float AXPYV - Added AVX512 based double and float AXPYV which will be used in Zen4 context. - Added n <= 0 check and alpha == 0 check to the BLAS layer of SAXPY. - Modified BLAS framework of float AXPYV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2793] Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1	2023-04-14 01:06:57 -04:00
Harihara Sudhan S	6b8f4744a4	Added AVX512 based double and float DOTV - Added AVX512 based double and float DOTV which will be used in Zen4 context. - Added n <= 0 check to the BLAS layer of SDOTV. - Modified BLAS framework of float DOTV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2800] Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702	2023-04-12 12:36:52 +05:30
Harihara Sudhan S	be7fb342c1	Added AVX512 based double and float SCALV - Added AVX512 based double and float SCALV which will be used in Zen4 context. - Added incx <= 0 check and alpha == 1 check to the BLAS layer of SSCAL. - Modified BLAS framework of float SCAL to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2766],[CPUPL-2765] Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc	2023-04-11 14:41:56 +05:30
Arnav Sharma	c14ce55bcb	Added SGEMM SUP blocksize override to zen4 context - Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for SGEMM. This can be updated once GEMMT specific optimization are added for AVX-512. - Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for SGEMM operation. AMD-Internal: [CPUPL-3060] Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a	2023-04-10 08:16:45 -04:00
vignbala	775ce1f13c	Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM - Implemented 12x4m column preferential SUP kernels(main and fringe cases). The main kernel dimension is 12x4, and the associated fringe kernel dimensions are : 12x3m, 12x2m, 12x1m 8x4, 8x3, 8x2, 8x1 4x4, 4x3, 4x2, 4x1 2x4, 2x3, 2x2, 2x1. - Included in-register transposition support for C, thus extending the storage scheme supports to CCC, CCR, RCC and RCR inside the milli-kernel. - Integrated conditional packing of A onto the SUP front end for dcomplex datatype. This redirects RRC and CRC storage schemes onto the preceding set of SUP kernels through storage scheme transformation(RCC and CCC respectively). - Updated the zen4 context file with the new set of SUP kernels, to get enabled appropriately. Furthermore, the context file was updated with the AVX-2 dotxv signatures for dcomplex datatype. This redirects the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines. - Added C prefetching onto L2-cache, and an unroll factor of 4 for the k loop in all the kernels. - Work in progress to include conjugate support and input spectrum extension for the AVX-512 SUP kernels. The current thresholds in zen4 context is the same as that of the zen3 thresholds for ZGEMM SUP. AMD-Internal: [CPUPL-3122] Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e	2023-04-06 04:49:15 -04:00

1 2 3 4 5 ...

592 Commits