amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 10:05:38 +00:00

Author	SHA1	Message	Date
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Shubham Sharma	ffa8f584be	Added ZTRSM AVX512 native path kernels - Added 4x12 ZGEMM row-preferred kernel. - Added 4x12 ZTRSM row-preferred lower and upper kernels using AVX512 ISA. - These kernels are used for ZTRSM only, zgemm still uses 12x4 kernel. - Kernels support row/col/gen storage. - Kernels support A prefetch, B prefetch, A_next prefetch, B_next prefetch and c prefetch. - B prefetch, B_next prefetch and C prefetch are enabled by default. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a	2023-11-03 09:42:24 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Shubham Sharma	9a2a4151ac	Added improved ZTRSM AVX2 kernels - Added 2x6 ZGEMM row-preferred kernel. - Kernel supports prefetch_a, prefetch_b, prefetch_a_next and prefetch_b_next. - Multiple Ways to prefetch c are supported. - prefetch_a and prefetch_c are enabled by default. - K loop is divided into multiple subloops for better c prefetch. - Added 2x6 ZTRSM row-preferred lower and upper kernels using AVX2 ISA. - These kernels are used for ZTRSM only, zgemm still uses 3x4 kernel. - Kernels support row/col/gen storage. - Updated the zen3 and zen4 config to enable use of these kernels for TRSM in zen3 and zen4 path. - Updated CMakeLists.txt with ZGEMM kernels for windows build. AMD-Internal: [CPUPL-3781] Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73	2023-10-13 07:43:33 -04:00
Meghana Vankadari	3a71550bc3	Enabling SUP blocksizes & kernels for generic config Details: - pack and compute extension APIs derive blocksizes(MR, NR...) from SUP cntx. - SUP blocksizes are not set for generic/skx configs. As a result pack and compute APIs cause floating point exceptions. - To fix these issues, we have enabled non-zero SUP blocksizes for generic config and zen4 SUP blocksizes for skx config. - However, these changes will not enable SUP path for skx/generic config as thresholds are set to zero. - To enable SUP path for skx config, more work is needed like non-zero thresholds and modifications to build system. Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9	2023-10-12 05:27:10 -04:00
Edward Smyth	85f2bf6c4a	Fix for x86_64 builds Configuration x86_64 includes all Intel and AMD sub-configurations. Fixes to enable this to work correctly again are: - In config_registry use amdzen rather than amd64 in x86_64 family. - Copy settings from config/amdzen/bli_family_amdzen.h to config/x86_64/bli_family_x86_64.h - Modify configure to set enable_aocl_zen=yes for x86_64, but not for amd64_legacy. - Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are enabled. Note: sub-configurations knl and bulldozer use instructions that are not supported on most x86_64 processors. AMD-Internal: [CPUPL-3838] Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04	2023-10-09 07:24:21 -04:00
Edward Smyth	24e4d58f92	Tidy zen bli_cntx_init and bli_family files Tidy formatting of config/zen/bli_cntx_init_zen.c and config/zen/bli_family_.c files to make them more consistent with each other and improve readability. AMD-Internal: [CPUPL-3519] Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b	2023-10-04 05:14:39 -04:00
orequest	09e34fd2bd	Added optimised CGEMM function pointers in zen4 cntx 1. Two CGEMM function pointers are added for different storage schemes 1. bli_cgemmsup_rv_zen_asm_3x8m 2. bli_cgemmsup_rv_zen_asm_3x8n 2. In previous commit: (Level-3 triangular routines now use different block sizes and kernels Commit Id: `79e174ff0a`) 1. bli_cntx_set_l3_sup_tri_kers cntx function was created 2. Function holds optimised function pointers for GEMMT/SYRK API's 3. It avoids over riding default block sizes which improves the performance 4. This function did not include optimised CGEMM function pointers leading to regression as reference kernels were invoked 3. With this commit, 2 optimized CGEMM function pointers are added in bli_cntx_set_l3_sup_tri_kers 1. This fixes the regression as optimized CGEMM functions are invoked AMD-Internal: [CPUPL-3831] [CPUPL-3830] Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e	2023-09-11 06:38:31 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Meghana Vankadari	79e174ff0a	Level-3 triangular routines now use different block sizes and kernels. Details: - Eliminated the need for override function in SUP for GEMMT/SYRK. - New set of block sizes, kernels and kernel preferences are added to cntx data structure for level-3 triangular routines. - Added supporting functions to set and get the above parameters from cntx. - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels. In case they are not set, use the default block sizes/kernels of Level-3 SUP. AMD-Internal: [CPUPL-3649] Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0	2023-07-26 01:26:11 -04:00
Edward Smyth	6911d2dd21	zen config make_defs.mk improvements Improvements to zen make_defs.mk files: * Add -znver4 flag for GCC 13 and later. * Add AVX512 flags or -znver4 as appropriate for upstream LLVM in config/zen4/make_defs.mk to enable BLIS to be build with LLVM rather than AOCC. * zen make_defs.mk files were inheriting settings from the previous one (zen->zen2->zen3->zen4), when they should be independent of each other. Correct by including config/zen/amd_config.mk in all zen make_defs.mk files to reinitialize the compiler flags. * Update zen2 and zen3 make_defs.mk for recent AOCC compiler releases, rather than rely on LLVM settings. * Remove -mfpmath=sse flag in config/zen4/make_defs.mk as this is already specified in amd_config.mk (and should be the default setting anyway). * Tidy files to simplify nested if structures and be more consistent with one another. AMD-Internal: [CPUPL-3399] Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac	2023-05-22 07:51:41 -04:00
Edward Smyth	7e50ba669b	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. Also correct file format of windows/tests/inputs.yaml, which was missed in commit `0f0277e104` AMD-Internal: [CPUPL-2870] Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549	2023-04-21 10:02:48 -04:00
Harihara Sudhan S	ada88e3695	Mismatch in fuse factor and kernel fuse - In Zen 4 context, there was a mismatch between the fuse factor initialized in the block size parameter and fuse factor of the corresponding kernel initialized. AMD-Internal: [SWLCSG-2051] Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d	2023-04-21 06:30:11 -04:00
Arnav Sharma	4aace5f524	Smart Threading for SGEMM SUP for Zen4 Architecture - Added Smart Threading logic for AVX-512 based SGEMM SUP. - Calculating ic and jc for optimal work distribution to the allocated threads based on logic similar to Zen3. - Zen4 Architecture specific Native-to-SUP check has been added to redirect few Native inputs to the SUP path based on the fact that in a multi-threaded environment some Native cases perfom better as SUP. - For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have been increased from 512 and 200 to 682 and 512, respectively. - Further optimizations to the work distribution logic will be added subsequently. AMD-Internal: [CPUPL-3248] Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca	2023-04-21 12:54:03 +05:30
Edward Smyth	b531022bac	BLIS cpuid: distinguish submodels within a microarchitecture Incorporate a means of detecting submodels of a microarchitecture, so that different optimizations e.g. block sizes or kernel choices can be used. The details are as follows: - Different models are currently only enabled for zen3 and zen4 architectures (for server parts). - There is a single enumeration (model_t) for all models for all architectures, but function bli_check_valid_model_id() should check the provided model_id against the suitable range within the enumeration for the provided arch_id. - To enable the model_id to be used within the cntx setup functions, checking of a user specified value of BLIS_ARCH_TYPE against the enabled configurations is delayed to a separate function, bli_arch_check_id(). - Default selection based on hardware can be overridden using the BLIS_MODEL_TYPE environment variable. Valid values are: Genoa, Bergamo, Genoa-X, Milan, Milan-X Values are case-insensitive and -X can also be specified as _X or X - Specifying an incorrect value for BLIS_MODEL_TYPE is not an error, but will result in the default option for that architecture being selected. This is different to specifying an incorrect value of BLIS_ARCH_TYPE, which is an error. - The environment variable BLIS_MODEL_TYPE can be renamed using the --rename-blis-model-type argument to configure (or cmake equivalent), in a similar way to renaming BLIS_ARCH_TYPE with --rename-blis-arch-type. - Configure option --disable-blis-arch-type will disable both BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables. - Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes, currently only for AMD cpus. Functions are provided to query these from other parts of the code, namely: uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size() AMD-Internal: [CPUPL-3033] Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702	2023-04-20 10:47:44 -04:00
Meghana Vankadari	f788618f27	Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config Details: - Overriding of blocksizes with avx-2 specific ones(6x8) is done for gemmt/syrk because near-to-square shaped kernel performs better than skewed/rectangular shaped kernel. - Overriding is done for S,D and Z datatypes. AMD-Internal: [CPUPL-3060] Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53	2023-04-20 08:52:34 -04:00
Meghana Vankadari	42d05a5aa0	DGEMM: Added decision logic to choose between sup vs native for zen4 architecture Details: - Added a new function for choosing between SUP and native implementation for a given size. - This function pointer is stored in cntx for zen4 config. - Divided total combinations of sizes into 3 categories: - one dimension is small - Two dimensions are small - All dimensions are small - Added different threshold conditions for each of the categories. AMD-Internal: [CPUPL-2755] Change-Id: Iae4bf96bb7c9bf9f68fd909fb757d7fe13bc6caf	2023-04-17 13:08:34 -04:00
Harihara Sudhan S	15bd0f9646	Added AVX512 based double and float AXPYV - Added AVX512 based double and float AXPYV which will be used in Zen4 context. - Added n <= 0 check and alpha == 0 check to the BLAS layer of SAXPY. - Modified BLAS framework of float AXPYV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2793] Change-Id: Ie6a0976c2cfcf81ae5125f5f9aad14477d4ebbd1	2023-04-14 01:06:57 -04:00
Harihara Sudhan S	6b8f4744a4	Added AVX512 based double and float DOTV - Added AVX512 based double and float DOTV which will be used in Zen4 context. - Added n <= 0 check to the BLAS layer of SDOTV. - Modified BLAS framework of float DOTV to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2800] Change-Id: I550fbcbb17d6d887b9ecbea23237dc806b208702	2023-04-12 12:36:52 +05:30
Harihara Sudhan S	be7fb342c1	Added AVX512 based double and float SCALV - Added AVX512 based double and float SCALV which will be used in Zen4 context. - Added incx <= 0 check and alpha == 1 check to the BLAS layer of SSCAL. - Modified BLAS framework of float SCAL to remove flag check and pick kernels based on architecture ID. - AVX512 kernel is disabled for other Zen configurations using BLIS_KERNELS_ZEN4 macro. AMD-Internal: [CPUPL-2766],[CPUPL-2765] Change-Id: I4cdd93c9adbfbf8f7632730b8606ddcf70edd1dc	2023-04-11 14:41:56 +05:30
Arnav Sharma	c14ce55bcb	Added SGEMM SUP blocksize override to zen4 context - Reverted the SUP blocksizes and kernels to use AVX2 SUP kernels for SGEMM. This can be updated once GEMMT specific optimization are added for AVX-512. - Updated 'bli_zen4_override_gemm_blkszs()' in zen4 context to override blocksize and kernels for SGEMM SUP to enable AVX-512 kernels for SGEMM operation. AMD-Internal: [CPUPL-3060] Change-Id: Ic9b3037363b6e5b59e5035c81651c97ce95d6d9a	2023-04-10 08:16:45 -04:00
vignbala	775ce1f13c	Implemented AVX-512 based 12x4 m-variant SUP kernels for ZGEMM - Implemented 12x4m column preferential SUP kernels(main and fringe cases). The main kernel dimension is 12x4, and the associated fringe kernel dimensions are : 12x3m, 12x2m, 12x1m 8x4, 8x3, 8x2, 8x1 4x4, 4x3, 4x2, 4x1 2x4, 2x3, 2x2, 2x1. - Included in-register transposition support for C, thus extending the storage scheme supports to CCC, CCR, RCC and RCR inside the milli-kernel. - Integrated conditional packing of A onto the SUP front end for dcomplex datatype. This redirects RRC and CRC storage schemes onto the preceding set of SUP kernels through storage scheme transformation(RCC and CCC respectively). - Updated the zen4 context file with the new set of SUP kernels, to get enabled appropriately. Furthermore, the context file was updated with the AVX-2 dotxv signatures for dcomplex datatype. This redirects the fringe cases of type 1x? to the pre-existing AVX-2 GEMV routines. - Added C prefetching onto L2-cache, and an unroll factor of 4 for the k loop in all the kernels. - Work in progress to include conjugate support and input spectrum extension for the AVX-512 SUP kernels. The current thresholds in zen4 context is the same as that of the zen3 thresholds for ZGEMM SUP. AMD-Internal: [CPUPL-3122] Change-Id: If40bc4409c6eb188765329508cf1f24c0eb12d1e	2023-04-06 04:49:15 -04:00
mkadavil	27a9e2a0ff	u8s8s32 fringe kernel optimizations. -The n fringe micro kernels uses only a few zmm registers for computing the output (eg: 6x16 uses 6 zmm registers for output as opposed to 24 used in 6x64). This results in lot of wasted registers that if utilized can help increase the MR dimension and thus improve the reuse of registers loaded with B. Based on this concept, the existing n fringe kernels are modified (6x16 -> 12x16, 6x32 -> 9x32). It is to be noted that the maximum number of registers are not used, since it results in cache inefficient code due to the increase in MR and thus more broadcasts required from unpacked A matrix. -Compiler flag updates for AOCC build to generate loops with 64 byte alignment. This has been observed to improve performance slightly when k dimension is small. AMD-Internal: [CPUPL-3173] Change-Id: I199ce75ef71d994ffe0067dac1ed804dce1742ca	2023-04-03 05:35:18 -05:00
Meghana Vankadari	34a17e78ec	Added a missing override function declaration for amdzen config AMD-Internal: [CPUPL-3096] Change-Id: I5ce1a0f8489d09a7934af89efb2aede564c24c7e	2023-03-29 02:32:49 -04:00
Mangala V	245fdf072c	AVX-512 based col-preferred kernels for ZGEMM in native path - Kernel block size is 12x4 - Updated the zen4 config to enable these kernels in zen4 path. - Tuned MC,KC,NC for better performance for m/n/k size > 500 - Updated CMakeLists.txt with ZGEMM kernels for windows build. Kernel supports: 1. Preload and prebroadcast of A and B 2. Prefecth of C Matrix 3. K loop is sub divided in to multiple loops to maintain distance between c prefetchs. 4. Special case when alpha/beta imag component is zero 5. Row/Col/General stride of Matrix C AMD-Internal: [CPUPL-2998] Change-Id: I62e3c352d475b1add3f43270805fbcee00e2e440	2023-03-28 23:05:06 -04:00
Harsh Dave	f5dc3db648	Added AVX512 8xk packing kernel AVX512 optimised kernel for Double datatype supports row and column major matrix Packing kernel is column major implementation If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store AMD-Internal: [CPUPL-2966] Change-Id: I8e43f1e2b562c382f44278cd47b3d1e84a4d24c9	2023-03-27 23:18:32 -05:00
Mangala V	62d63eb1ba	AVX-512 based 4xk and 12xk packing kernel for dcomplex AVX512 packing kernel supports: 1. Dcomplex datatype 2. Row and column major matrix AVX512 packing kernel doesnot support: 1. General stride matrix 2. Fringe cases(only multiplies of 4 or 12 is supported) 3. Conjugate is not supported scal2m will be used for above unsupported functionality AVX512 packing kernel is column preferred kernel If matrix is row major, we need to transpose block before storing it. If matrix is column major, we directly store it AMD-Internal: [CPUPL-3088] Change-Id: I3fcd94248a3a6527c807cccc1b3408db9fe2a737	2023-03-28 00:19:08 +05:30
Meghana Vankadari	31a4203c32	Added AVX-512 based col-preferred kernels for DGEMM with optional pack framework - Main kernel is of size 24x8 and the associated fringe kernels added are - 24x7m, 24x6m, 24x5m, 24x4m, 24x3m, 24x2m, 24x1m - 24x8, 24x7, 24x6, 24x5, 24x4, 24x3, 24x2, 24x1 - 16x8, 16x7, 16x6, 16x5, 16x4, 16x3, 16x2, 16x1 - 8x8, 8x7, 8x6, 8x5, 8x4, 8x3, 8x2, 8x1 - For fringe kernels, 24x? kernel handles 16 < m_remainder < 24 16x? kernel handles 8 < m_remainder <= 16 8x? kernel handles 0 < m_remainder <= 8 - Added a function 'bli_zen4_override_gemm_blkszs' to override blocksizes and kernels to be used for SUP for supported storage schemes. - Updated the zen4 config to enable these kernels in zen4 path. - Thresholds are yet to be derived. - Updated CMakeLists.txt with DGEMM SUP kernels for windows build. Kernel-specific details: - K-loop is unrolled by 8 times to facilitate prefetch of B. - For every load of one column of A, the corresponding column in next panel of A is prefetched with T1 hint. - One column of C is prefetched with T0 hint per iteration of LOOP2. - TAIL_NITER is derived to be 3. - For every unroll of k-loop, one row of B is prefetched with T0 hint. - C-prefetching for row-storage is yet to be added. - B-prefetching for col-storage is yet to be added. - Support for C transpose is yet to added. AMD-Internal: [CPUPL-2755], [CPUPL-2409] Change-Id: Ie240c893469032dc2271cbfe00cceccfe6c4ea48	2023-03-24 06:40:36 +00:00
Shubham	323e31649f	Added AVX512 8x24 DTRSM native path kernels - Added DGEMM and DTRSM row preferred micro kernels. - DTRSM left lower and left upper micro kernels are added. - DGEMM kernel is optimized for both row stored C and col stored C. AMD-Internal: [CPUPL-2745] Change-Id: Iecd2c1b0b0972e17e7b31e4b117e49c90def5180	2023-03-23 00:43:15 -04:00
mkadavil	3d74b62e60	Lpgemm threading and micro-kernel optimizations. -Certain sections of the f32 avx512 micro-kernel were observed to slow down when more post-ops are added. Analysis of the binary pointed to false dependencies in instructions being introduced in the presence of the extra post-ops. Addition of vzeroupper at the beginning of ir loop in f32 micro-kernel fixes this issue. -F32 gemm (lpgemm) thread factorization tuning for zen4/zen3 added. -Alpha scaling (multiply instruction) by default was resulting in performance regression when k dimension is small and alpha=1 in s32 micro-kernels. Alpha scaling is now only done when alpha != 1. -s16 micro-kernel performance was observed to be regressing when compiled with gcc for zen3 and older architecture supporting avx2. This issue is not observed when compiling using gcc with avx512 support enabled. The root cause was identified to be the -fgcse optimization flag in O2 when applied with avx2 support. This flag is now disabled for zen3 and older zen configs. AMD-Internal: [CPUPL-3067] Change-Id: I5aef9013432c037eb2edf28fdc89470a2eddad1c	2023-03-16 11:44:51 +05:30
Harihara Sudhan S	299bed3fa8	Vectorized SCAL2V for double complex - Added SCAL2V kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero or incx <= 0 or incy <= 0. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SSE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. - Added the new SCAL2V file from the CMAKE list. AMD-Internal: [CPUPL-2773] Change-Id: I2debbfab31d41347786c3a1bae5723d092c202e9	2023-02-28 06:59:23 -05:00
Shubham	1faee9f89e	Fixed 32xk AVX512 double precision pack kernel - Currently the pointer received as function argument is used for packing which causes only a partial copy of input buffer to output buffer due to strange optimizations by compiler. - To fix this, instead of using a normal pointer for output buffer, we define a "restrict" local pointer variable. - "restrict" keyword tells the compiler that the pointer is the only way to access the object pointed by the pointer. - By defining "restrict" local pointer pointing to output buffer, the mysterious problem of incomplete copy has been solved. Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646	2023-02-20 21:18:54 -05:00
Arnav Sharma	46965dfc57	Developed 6x64 SGEMM row-preferred kernels - Added kernels for all rv and rd variants. - Main kernel is of size 6x64, and the associated fringe kernels added are - 4x64, 2x64, 1x64 - 6x32, 4x32, 2x32, 1x32 - 6x16, 4x16, 2x16, 1x16 - Updated the zen4 config to enable these kernels in zen4 path. - Added C-prefetching to 6x? row-stored main kernels. - C-prefetching for column storage yet to be added. - K-loop unrolling for fringe kernels yet to be added. AMD-Internal: [CPUPL-3002] Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4	2023-02-19 23:56:58 -05:00
Harihara Sudhan S	535b49f150	Vectorized COPYV for double complex - Added ZCOPYV kernel that uses AVX2 and SSE instructions for vectorization. - The routine returns early when the vector dimension is zero. - The kernel takes one among the two available paths based on conjugation requirement of X vector. - VZEROUPPER is added before transitioning from AVX2 to SEE. - Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts. AMD-Internal: [CPUPL-2773] Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0	2023-02-16 09:32:10 -05:00
Harihara Sudhan S	8d7593a940	AVX2 vectorized SCALV for double complex - Added ZSCALV that uses AVX2 and SSE instructions for vectorization. - Return early when the vector dimension is zero. When alpha is 1 there is no need to perform computation hence return early. - When alpha is zero expert interface of ZSETV is invoked. In this case, all the elements of the input vector are set 0. - Invocation of expert interface means that NULL pointer can be passed to the function in place of context. Expert interface of ZSETV will query the context and get the approriate function pointer. - Added BLAS interface for ZSCALV. The architecture ID is used to decide the function that is to be invoked. - Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV BLAS macro interface only for single complex type and single complex, float mixed type AMD-Internal: [CPUPL-2773] Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050	2023-02-05 09:36:12 -05:00
mkadavil	d2713d3dc0	GCC compiler optimization flag (kernel) update for zen3 and zen4 config. -Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics code) when compiled using gcc. The presence of -fschedule-insns + -fschedule-insns2 + -ftree-pre in O2 compiler optimization flags results in the code being optimized to reduce data stalls, and results in the usage of stack to store intermediate C register output. Disabling -ftree-pre in gcc fixes the issue, even in the presence of the other two flags. AMD-Internal: [CPUPL-2971] Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463	2023-02-02 23:58:14 -05:00
Kiran Varaganti	201db7883c	Integrated 32x6 DGEMM kernel for zen4 and its related changes are added. Details: - Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path. Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and implementing these optimizations. - In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from memory, since the elements of Bpack are stored contiguously, the first broadcast fetches the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast and interleave broadcast operation with FMAs to hide any memory latencies. - Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..) when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}. Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced dgemm kernel for TRSM with 16x14 dgemm kernel. - New packm kernels - 16xk, 24xk and 32xk are added. - New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is enabled for zen4 config (bli_dpackm_32xk_zen4_ref() ) - Copyright year updated for modified files. - cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h - [SWLCSG-1374], [CPUPL-2918] Change-Id: I576282382504b72072a6db068eabd164c8943627	2023-01-19 23:11:36 +05:30
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Vignesh Balasubramanian	dbd0c069d4	Implemented 3x4 RD kernels for ZGEMM in SUP path -Implemented (r)ow preferential (d)ot product milli-kernels (m and n variants) for dcomplex datatype along SUP path. -These computational kernels extend the support for handling RRC and CRC storage schemes along the SUP path. In case of BLAS api call, it corresponds to the input cases with transa equal to T and transb equal to N. -In case of the B matrix being packed(conditionally), the inputs are redirected to the existing (r)ow preferential (v)ector load optimized kernels due to better performance. -Added macro for vhsubpd assembly instruction, to support the arithmetic for complex datatype in its interleaved storage. AMD-Internal: [CPUPL-2593] Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8	2022-12-22 18:31:46 +05:30
Harihara Sudhan S	42d631bced	Copyright modification - Added copyright information to modified/newly created files missing them Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71	2022-10-14 12:43:35 +05:30
Edward Smyth	991f2ec0e8	BLIS: Errors in Netlib LAPACK Tests Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in the netlib LAPACK tests, specifically in: ./xlintstd < ../dtest.in > dtest.out in the TESTING/LIN directory. Given time constraints, i.e. the need to finalize code for AOCL 4.0 release, disable calls to AVX512 kernel (i.e. always use the AVX2 kernel) for now, and aim to correct bli_damaxv_zen_int_avx512 for AOCL 4.1. AMD-Internal: [CPUPL-2590] Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572	2022-10-03 10:55:49 +00:00
Shubham Sharma	2d72d4c862	DTRSM Native path AVX 512 kernels are developed - Added DTRSM AVX512 kernels for lower and upper variants in the native path. - Changes in framework are made to accommodate these kernels. AMD-Internal: [CPUPL-2588] Change-Id: I1f74273ef2389018343c0645870290373ce25efe	2022-10-01 15:58:24 +00:00
mkadavil	f4702debb9	Zen4 compilation flag updates to support low precision gemm. - BFloat16 flags added to zen4 make_defs in order to enable compilation of low precision gemm by using zen4 config. - Avoid -ftree-partial-pre optimization flag with gcc due to non optimal code generation for intrinsics based kernels in low precision gemm. - Enable only Zen3 specific low precision gemm kernels (s16) compilation when aocl_gemm addon is compiled on Zen3 machines. AMD-Internal: [CPUPL-1545] Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f	2022-09-29 08:19:40 -04:00
Dipal M. Zambare	17694b6ca5	Enabled AVX512 compiler flags for the reference kernels. - Updated zen4 configuration to enable AVX512 flags for the reference kernels - Reference and vector kernels will use the same compiler flags AMD-Internal: [CPUPL-2533] Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8	2022-09-23 11:39:33 +05:30
Eleni Vlachopoulou	a5891f7ead	Adding AVX2 support for DNRM2 - For the cases where AVX2 is available, an optimized function is called, based on Blue's algorithm. The fallback method based on sumsqv is used otherwise. - Scaling is used to avoid overflow and underflow. - Works correctly for negative increments. AMD-Internal: [CPUPL-2551] Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c	2022-09-20 06:05:01 -04:00
Dipal M Zambare	6b71fea1f4	GCC 8 support for zen4 (and amdzen) configuration - Added check for GCC version 8, - Added AVX512 compiler flags needed for zen4 build using GCC 8. AMD-Internal: [CPUPL-2494] Change-Id: I7fd72e4b197fdd754633f674a8b87f01da8dd320	2022-09-06 16:58:02 +05:30
Dipal M Zambare	d3b503bbf2	Code cleanup and warnings fixes - Removed all compiler warnings as reported by GCC 11 and AOCC 3.2 - Removed unused files - Removed commented and disabled code (#if 0, #if 1) from some files AMD-Internal: [CPUPL-2460] Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a	2022-08-29 15:15:40 +05:30
Vignesh Balasubramanian	cf31fcd020	Fine tuned threshold and aocl dynamic for zgemm for skinny matrices. -Updated optimal threads in zgemm sup path for skinny matrices. -Fine tuned the threshold values for small and sup paths to improve overall zgemm. -Zgemm small is selected for inputs with transb as N. -Redirection of input among small, sup and native path was fine tuned. AMD-Internal : [CPUPL-1900] Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f	2022-08-19 01:19:14 -04:00
Arnav Sharma	a226e54421	AVX512 based SGEMM Optimizations - Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 Native SGEMM kernel. AMD-Internal: [CPUPL-2385] Change-Id: I1feae5ac79e960c6b26df24756d460243820b797	2022-08-12 02:33:39 -04:00
Dipal M Zambare	5d617429f4	Enabled znver4 support for GCC version >= 12 - Updated zen4 configuration to add -march=znver4 flag in the compiler options if the gcc version is above or equal to 12 AMD-Internal: [CPUPL-1937] Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f	2022-07-22 12:46:15 +05:30

1 2 3 4 5 ...

552 Commits