amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
Edward Smyth	d69473c8f7	Update zen4 test in bli_cpuid.c Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for all Zen4 models, based on family and AVX512 feature tests. AMD-Internal: [CPUPL-2474] Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72	2022-09-01 05:12:24 -04:00
eashdash	32a9e735f1	BF16 Output downscaling functionality - BF16 instructions output is accumulated at a higher precision of FP32 which needs to be converted to a lower precison of bf16 post the GEMM operations. This is required in AI workloads where both input and output are in BF16 format. - BF16 downscaling is implemented as post-ops inside the GEMM microkernels. Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8	2022-08-30 13:46:09 -04:00
Harihara Sudhan S	5faab43e66	Downscaling as part of u8s8s16os16 - int16 c matrix intermediate values are converted to int32, then the int32 values are converted to fp32. On these fp32 values scaling is done - The resultant value is down scaled to int8 and stored in a separate buffer AMD-Internal: [2171] Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab	2022-08-30 13:41:36 -04:00
mkadavil	958c9238ac	Output downscaling support for low precision GEMM. - Downscaling is used when GEMM output is accumulated at a higher precision and needs to be converted to a lower precision afterwards. This is required in AI workloads where quantization/dequantization routines are used. - New GEMM APIs are introduced specifically to support this use case. Currently downscaling support is added for s32, s16 and bfloat16 GEMM. AMD-Internal: [CPUPL-2475] Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf	2022-08-30 10:27:19 -04:00
satish kumar nuggu	d62f12a18a	Fixed bug in DZGEMM Details: 1. In zen4 dgemm and sgemm native kernels are column-prefer kernels, cgemm and zgemm kernels are row-prefer kernels. zen3 and older arch (uses row-prefer kernels for all datatypes). Induced-transpose carried out based on kernel preference check in both crc and ccr. we don't require kernel preference check to apply Induced-transpose. 2. In case of ccr, B matrix is real. We must Induce a transposition and perform C+=A*B (crc). where A (formerly B) is real. 3. In case of crc, A matrix is real. We don't require any Induced-transpose. AMD-Internal: [CPUPL-2440] [CPUPL-2449] Change-Id: I44c53a20c8def7ddbb84797ba20260acec2086a2	2022-08-30 09:34:05 -04:00
satish kumar nuggu	99dc9066fb	Fix in ZTRSM Small MT Details: 1. Changes are made in ztrsm small MT path,to avoid accuracy issues reported in libflame tests. AMD-Internal: [CPUPL-2476] Change-Id: Ic279106343fb1744e89ff4c920023adbe1d0158a	2022-08-30 18:02:19 +05:30
Dipal M Zambare	5c42afada8	Revert "CBLAS/BLAS interface decoupling for level 3 APIs" This reverts commit `d925ebeb06`. Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3	2022-08-30 14:50:38 +05:30
eashdash	e674fae758	Post-Ops for bf16bf16f32 Functionality - Post-ops is a set of operations performed elemnent wise on the output matrix post GEMM operation. The support for the same is added by fusing post-ops with GEMM operations. - Post-ops Bias, Relu and Parametric Relu are added to all the compute kernels of bf16bf16f32of32 - Modified bf16 interface files to add check for bf16 ISA support Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3	2022-08-30 08:14:14 +00:00
mkadavil	a7d1cc7369	Multi-Threading support for BFloat16 gemm. -OpenMP based multi-threading support added for BFloat16 gemm. Both gemm and reorder api's are parallelized. -Multi-threading support for u8s8s16 reorder api. -Typecast issues fixed for bfloat16 gemm kernels. AMD-Internal: [CPUPL-2459] Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109	2022-08-30 02:54:19 -04:00
Dipal M Zambare	7e42b3d2e0	Revert "CBLAS/BLAS interface decoupling for level 2 APIs" This reverts commit `192f5313a1`. Change-Id: I876cad90902970ebc61550f109eb0ce32539ea1c	2022-08-30 11:53:46 +05:30
Dipal M Zambare	6cff8b030e	Revert "CBLAS/BLAS interface decoupling for level 1 APIs" This reverts commit `95169ca806`. Change-Id: Ic441aca616be6f27c7f1ba64e4480edcc6b17632	2022-08-30 11:34:34 +05:30
Nallani Bhaskar	ea79efa915	Fixed out of bound memory access in sgemmsup zen rv kernels Details: 1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction is used to load B matrix in k loop and k last loop, which is loading 128 bit into xmm than 64 bit as expected. 2. Changed vmovps instruction to vmovsd instrucntions which load only 64 bit in xmm register 3. Avoided C memory access by vfma instruction when multiplying with non-beta at corner cases with required access to 128 bit which leads to out of bound. Replaced with vmovq first to get 64 bit data then peformed vfma on xmm register in rv_6x8m and rv_6x4m AMD-Internal: [CPUPL-2472] Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2	2022-08-30 01:13:14 -04:00
Dipal M Zambare	40c71dd2e1	Revert "CBLAS/BLAS interface decoupling for swap api" This reverts commit `2beaa6a0e6`. Reverting it as it is planned for the next release. Change-Id: Ib9271acd0b5b4cfd10c8f8b7bbb6ef93a3d594ea	2022-08-30 10:10:06 +05:30
Edward Smyth	abf848ad12	Code cleanup and warnings fixes - Removed some additional compiler warnings reported by GCC 12.1 - Fixed a couple of typos in comments - frame/3/bli_l3_sup.c: routines were returning before final call to AOCL_DTL_TRACE_EXIT - frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is only defined in header file if BLIS_ENABLE_OPENMP is defined AMD-Internal: [CPUPL-2460] Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde	2022-08-29 08:22:30 -04:00
Edward Smyth	9b28371f45	BLIS: BLAS3 quick return functionality Bugfix for http://gerrit-git.amd.com/q/I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c changes in frame/compat/bla_gemm.c AMD-Internal: [CPUPL-2373] Change-Id: Ia58b76f060eda204bba68be6730105f4d8db7537	2022-08-29 08:17:50 -04:00
Arnav Sharma	eb83a0fe9d	Enabled ZHER Optimized Path - While calculating the diagonal and corner elements, the combined operation of calculating the product of x and x hermitian and simultaneously scaling it with alpha and adding the result to the matrix was the cause of increased underflow and overflow errors in netlib tests. - So the above calculation is now being done in three steps: scaling x vector with alpha, then calculating its product with x hermitian and later adding the final result to the matrix. AMD-Internal: [CPUPL-2213] Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8	2022-08-29 08:09:42 -04:00
jagar	2beaa6a0e6	CBLAS/BLAS interface decoupling for swap api - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it. - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8d81072aaca739f175318b82f6510d386103c24b	2022-08-29 16:26:01 +05:30
Dipal M Zambare	d3b503bbf2	Code cleanup and warnings fixes - Removed all compiler warnings as reported by GCC 11 and AOCC 3.2 - Removed unused files - Removed commented and disabled code (#if 0, #if 1) from some files AMD-Internal: [CPUPL-2460] Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a	2022-08-29 15:15:40 +05:30
jagar	95169ca806	CBLAS/BLAS interface decoupling for level 1 APIs - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion - This change separate the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a	2022-08-29 14:28:20 +05:30
Harihara Sudhan S	326d8a557f	Performance regression in u8s8s16os16 - Performance of u8s8s16os16 came down by 40% after the introduction of post-ops - Analysis revealed that the target compiler assumed false dependency and was generating sub-optimal code due to the post-ops structure - Inserted vzeroupper to hint the compiler that no ISA change will occur AMD-Internal: [CPUPL-2447] Change-Id: I0b383b9742ad237d0e053394602428872691ef0c	2022-08-29 03:20:02 -04:00
jagar	192f5313a1	CBLAS/BLAS interface decoupling for level 2 APIs - In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. - If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion - This change separates the CBLAS and BLAS implantation by adding an additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I8380b6468683028035f2aece48916939e0fede8a	2022-08-29 09:47:19 +05:30
Chandrashekara K R	d925ebeb06	CBLAS/BLAS interface decoupling for level 3 APIs ->In BLIS the cblas interface is implemented as a wrapper around the blas interface. For example the CBLAS api ‘cblas_dgemm’ internally invokes BLAS API ‘dgemm_’. ->If the end user wants to use the different libraries for CBLAS and BLAS, current implantation of BLIS doesn’t allow it and may result in recursion ->This change separate the CBLAS and BLAS implantation by adding and additional level of abstraction. The implementation of the API is moved to the new function which is invoked directly from the CBLAS and BLAS wrappers. AMD-Internal: [SWLCSG-1477] Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5	2022-08-26 05:54:29 -04:00
satish kumar nuggu	2114a43df8	Fixes to avoid Out of Bound Memory Access in TRSM small algorithm Details: 1. Fixed the issues corresponding to Out of bound memory access during load and store. 2. In Intrinsic code: i. AVX2 Registers can hold 4 double elements. ii. In case of remainder when number of elements is lessthan vectorised register. Though the required number of elements are lessthan 4, we are reading and writing in chunks of 4 elements due to vectorization. This might cause out of bound memory access. 3. Redesigned code to restrict out of bound access by loading and storing the exact number of elements required. AMD-Internal: [SWLCSG-1470] Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b	2022-08-26 01:06:10 -04:00
Harihara Sudhan S	5ca632e0f0	Added API to check for BF16 ISA support - Checking for AVX512 bfloat 16 instructions support in architecture using the CPUID AMD-Internal: [CPUPL-2446] Change-Id: I088a8aa46b037af837b2e58a96b59eae70c1dbf0	2022-08-25 11:00:35 -04:00
mkadavil	584069bf74	Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM. -Parametric ReLU is the generalization of leaky ReLU in which the leakage coefficient is tunable. The support for the same is added following the register-level fusion technique. -Low precision bench enhancement to check accuracy/performance of low precision gemm with PReLU. -Bug fixes in low precision gemm kernels. AMD-Internal: [CPUPL-2442] Change-Id: I81336405b185a994297d122b2d868b758ae6dad5	2022-08-25 13:33:02 +05:30
eashdash	4e3e00fb7e	Added low precision GEMM - bf16bf16f32of32 Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float. 1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps 2. Kernel for packing B matrix is provided Change-Id: If5d08213068869eff060c9998596d2d2703a6793	2022-08-24 03:27:00 -04:00
satish kumar nuggu	219c41ded9	ZTRSM Improvements Details: 1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios. 2. Enabled multithreading execution for bli_trsm_small implementation for double complex data type. 3. Added decision logic to choose between native vs multi-threaded small path for sizes upto 500 and threads upto 8. AMD-Internal: [CPUPL-2340] Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1	2022-08-20 05:08:47 -04:00
satish kumar nuggu	0b81f53074	Fixed bug in DZGEMM 1. In zen4 dgemm and sgemm native kernels are column-prefer kernels, cgemm and zgemm native kernels are row-prefer kernels. zen3 and older arch (uses row-prefer kernels for all datatypes) hence induced-transpose carried out based on kernel preference check. Added a condition check, output matrix storage format need to be checked along with kernel preference to avoid induced-transpose for zen4. 2. Added functions bli_cntx_l3_vir_ukr_dislikes_storage_of_md, bli_cntx_l3_vir_ukr_prefers_storage_of_md for checking output matrix storage format and micro kernel preference of mixed datatypes. AMD-Internal: [CPUPL-2347] Change-Id: Ib77676f4e2152f7876ad7dc91de716547f5ba3a5	2022-08-19 12:37:01 -04:00
Shubham Sharma	b8b339416a	DGEMMT optimizations Details: 1. For lower and upper, "B" column major storage variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I9748b2b52557718e7497ecf046530d3031636a63	2022-08-19 12:31:35 -04:00
Arnav Sharma	035ed98b51	Temporarily disabling optimized ZHER - Disabling optimized ZHER pending verification with netlib BLAS test. AMD-Internal: [CPUPL-2416] Change-Id: I74c4d16e1c99ddeb1df91130a8e14feafd0952d0	2022-08-19 12:16:46 -04:00
Edward Smyth	6861fcae91	BLIS: Improve architecture selection at runtime Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is now "generic", so that it should be constant as new code paths are added. Thus all other code path enum values have increased by 2. Also added new options to BLIS configure program to allow: 1. BLIS_ARCH_TYPE functionality to be disabled, e.g.: ./configure --disable-blis-arch-type amdzen 2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a specified value, e.g.: ./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen On Windows, these can be enabled with e.g.: cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON or cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE This implements changes 2 and 3 in the Jira ticket below. AMD-Internal: [CPUPL-2235] Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4	2022-08-19 10:59:35 -04:00
Sireesha Sanga	22af681a11	Runtime Thread Control Feature Update Details: 1. Runtime Thread Control Feature is enhanced to create a provision for the application to allocate a different number of threads to BLIS from the number of threads application is using for itself. 2. In the previous implementation, if application sets BLIS_NUM_THREADS with a valid value, BLIS internally calls omp_set_num_threads() API with same value. Due to this, application could not differentiate between the number of threads used in BLIS library and the application. 3. With the current solution, if Application wants to allocate different number of threads for BLIS API and application, Application can choose either BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API for BLIS, and OpenMP APIs or environment variables for itself, respectively. 4. If BLIS_NUM_THREADS is set with a valid value, same value will be used in the subsequent parallel regions unless bli_thread_set_num_threads() API is used by the Application to modify the desired number of threads during BLIS API execution. 5. Once BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API is used by the application, BLIS module would always give precedence to these values. BLIS API would not consider the values set using OpenMP API omp_set_num_threads(nt) API or OMP_NUM_THREADS environment variable. 6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and issued omp_set_num_threads(nt) with desired number of threads, omp_get_max_threads() API will fetch the number of threads set earlier. 7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called by the application, but only OMP_NUM_THREADS is set, omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS. 8. If both environment variables are not set, or if they are set with invalid values, and omp_set_num_threads(nt) is not issued by application, omp_get_max_threads() API will return the number of the cores in the current context. 9. BLIS will initialize rntm->num_threads with the same value. However if omp_set_nested is false - BLIS APIs called from parallel threads will run in sequential. But if nested parallelism is enabled Then each application will launch MT BLIS. 10. Order of precedence used for number of threads: 0. value set using bli_thread_set_num_threads(nt) by the application 1. valid value set for BLIS_NUM_THREADS environment variable 2. omp_set_num_threads(nt) issued by the application 3. valid value set for OMP_NUM_THREADS environment variable 4. Number of cores 11. If nt is not a valid value for omp_set_num_threads(nt) API, number of threads would be set to 1. omp_get_max_threads() API will return 1. 12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled. AMD-Internal: [CPUPL-2342] Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6	2022-08-19 05:43:01 -04:00
Vignesh Balasubramanian	cf31fcd020	Fine tuned threshold and aocl dynamic for zgemm for skinny matrices. -Updated optimal threads in zgemm sup path for skinny matrices. -Fine tuned the threshold values for small and sup paths to improve overall zgemm. -Zgemm small is selected for inputs with transb as N. -Redirection of input among small, sup and native path was fine tuned. AMD-Internal : [CPUPL-1900] Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f	2022-08-19 01:19:14 -04:00
Shubham Sharma	32c9239c7f	Optimization of DGEMMT SUP kernels Details: 1. Optimized the kernels by replacing the macros with the actual computation of required output elements. AMD-Internal: [CPUPL-2341] Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5	2022-08-18 08:30:19 -04:00
Shubham Sharma	8adef27aca	Optimization of DGEMMT SUP kernel for beta zero cases. Details: 1. In kernels for non-transpose variants, changes are made to optimize the cases of beta zero. 2. Validated the changes with BLIS Testsuite, GTestSuite(Functionality, Valgrind, Integer Tests) and Netlib Tests. 3. Fixed warnings during the build process. AMD-Internal: [CPUPL-2341] Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a	2022-08-18 08:05:58 -04:00
Edward Smyth	7f322da01d	BLIS: BLAS3 quick return functionality Implement netlib BLAS style quick return functionality for when no work is required. Similar functionality was already in HERK and HER2K routines. AMD copyrights updated. AMD-Internal: [CPUPL-2373] Change-Id: I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c	2022-08-18 04:09:56 -04:00
mkadavil	171fb7358d	SGEMM Optimization -sup GEMM - 2 variants var2m (block-panel) and var1n (panel-block). We added decision logic to choose between var1n and var2m for single thread SGEMM.var1n is favorable option when "n" is very large compared to "m". -Also fixed a bug related to fetching "MR" "NR" values in bli_gemmsup_int(). We replaced "bli_cntx_get_blksz_def_dt()(used for Native)" with "bli_cntx_get_l3_sup_blksz_def_dt()". AMD-Internal: [CPUPL-2406] Change-Id: If36529015b1c5f8f87eb40c05ebcf433c471d4d5	2022-08-18 01:39:42 -04:00
Harsh Dave	46e7727ea8	DGEMM Improvements - Incase of DGEMM when m, n and leading dimensions are large packing of A and B matrixes are required for optimal performance. - Modified decision logic to choose between sup vs native, now apart from matrix dimensions, we also incorporate matrix leading dimensions into this decision. AMD-Internal: [CPUPL-2366] Change-Id: I255db5f7049d783e22d7c912edf8bbf023e32ed8	2022-08-18 01:14:40 -04:00
Nallani Bhaskar	39196d163e	Enable packing of A & B dynamically in dgemmsup. Details: - When work distributed for each thread is larger than caches, it is advisable to perform packing of B for sup dgemm. - Work distribution per thread is calculated based on the values of jc_nt and ic_nt. - For RRC and CRC cases we want to avoid rd kernels which are not efficient in performance compared to rv kernels. Therefore we perform packing of A as well so that rv kernels are invoked for these cases. - These changes result in improved DGEMM performance. - Dynamic packing is done using the API "bli_rntm_set_pack_b( 1, rntm )" Change-Id: I8344520b4a2591e57518bb54183a15957f60f94b	2022-08-17 12:36:15 -04:00
Harsh Dave	e2e1dadee1	DGEMM Improvements - We prefetch next panel while packing 8xk panel. - Modified prefetch offsets for dgemm native and dgemm_small kernel. AMD-Internal: [CPUPL-2366] Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed	2022-08-16 08:07:30 -04:00
satish kumar nuggu	88e44c64e3	Fixed Memory Leaks in TRSM 1. Fixed the memory leaks in corner cases which caused due to extra loads in all datatypes(s,d,c,z). 2. In remainder cases instead of loading required number of elements, loaded extra elements which lead to memory leaks. Fixed memory leaks by restricting number of loads to required number of elements. AMD-Internal: [CPUPL-2280] Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a	2022-08-16 05:08:40 -04:00
mkadavil	6fbdfc3cf2	Low precision gemm refactoring and bug fixes. -The micro-kernel function signatures follow a common pattern. These functions can be represented as an instantiation of a MACRO as is done in BLIS, and thus the number of micro-kernel header files can be brought down. A new single header file containing all the MACRO definitions with the instantiation is added, and the existing unnecessary header files are removed. -The bias addition in micro-kernel for n remaining < 16 reads the bias array assuming it contains 16 elements. This can result in seg-faults, since out of bound memory is accessed. It is fixed by copying required elements to an intermediate buffer and using that buffer for loading. -Input matrix storage type parameter is added to lpgemm APIs. It can be either row or column major, denoted by r and c respectively. Currently only row major input matrices are supported. -Bug fix in s16 fringe micro-kernel to use correct offset while storing output. AMD-Internal: [CPUPL-2386] Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3	2022-08-14 17:39:00 +05:30
Shubham Sharma	f5ef30a44a	Fix in DGEMMT SUP kernel Details: 1. Due to error in C output buffer address computation in kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid memory is being accessed. This is causing seg fault in libflame netlib testing. 2. Validated the fix with libflame netlib testing. AMD-Internal: [CPUPL-2341] Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed	2022-08-12 21:03:36 +05:30
Arnav Sharma	a226e54421	AVX512 based SGEMM Optimizations - Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 Native SGEMM kernel. AMD-Internal: [CPUPL-2385] Change-Id: I1feae5ac79e960c6b26df24756d460243820b797	2022-08-12 02:33:39 -04:00
Dipal M. Zambare	c85bbfdb50	Updated BLIS version string format - Updated version string to match the recommended format “AOCL-BLIS 3.2.1 Build 20220727”. - Fixed issues with include paths which was preventing compile time version sting definition passing via build commands. - Removed version string determination based on git tag using ‘git describe’, version string will always be taken from the version file. AMD-Internal: [CPUPL-2324] Change-Id: Idc7edf1211f66d348ec3b5b43f2507c2b810f088	2022-08-12 05:53:35 +00:00
Shubham Sharma	4bca7f6f4a	DGEMMT optimizations Details: 1. For lower and upper, non-transpose variants of gemmt, new kernels are developed and optimized to compute only the required outputs in the diagonal blocks. 2. In the previous implementation, all the 48 outputs of the given 6x8 block of C matrix are computed and stored into a temporary buffer. Later,the required elements are copied into the final C output buffer. 3. Changes are made to compute only the required outputs of the 6x8 block of C matrix and directly stored in the final C output buffer. 4. With this optimization, we are avoiding copy operation and also reducing the number of computations. 5. Kernels specific to compute Lower and Upper Variant diagonal outputs have been added. 6. SUP Framework changes to integrate the new kernels have been added. 7. These kernels are part of the SUP framework. AMD-Internal: [CPUPL-2341] Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a	2022-08-11 06:16:33 -04:00
Mangala.V	8504ef013d	Optimisation of DTRSM and ZTRSM 1. Extract instruction replaced with cast when accessing first 128bit, as cast inst needs no cycle but extract takes few cycles 2. Added prefetch of A buffer when computing gemm operation 3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c With above changes performance improvements observed in case of Single thread Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be	2022-08-11 01:39:40 -04:00
Edward Smyth	737e08cd7a	BLIS: Improve architecture selection at runtime Enable meaningful names as options for BLIS_ARCH_TYPE environment variable. For example, BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6 will select the same code path (in this release). The meaningful names are not case sensitive. This implements change 1 in the Jira ticket below. Following review comments: 1. Use names from arch_t enum in function bli_env_get_var_arch_type() rather than directly using numbers. 2. AMD copyrights updated. AMD-Internal: [CPUPL-2235] Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad	2022-08-10 08:26:49 -04:00
Harihara Sudhan S	d1eaf65a26	Post-Ops for u8s8s16os16 Functionality - Post-ops is an operation performed on every element of the output matrix after GEMM operation is completed. - Post-ops relu and bias added to all the compute kernels of u8s8s16os16 - Post-ops are done on the value loaded into the register to avoid reloading of C matrix elements - Minor bug fixes in openmp thread decorator of lpgemm - Added test cases to lpgemm bench input file AMD-Internal: [CPUPL-2171] Change-Id: If49f763fdfac19749f6665c172348691165d8631	2022-08-09 14:52:41 +05:30
Harihara Sudhan S	60de0a1856	Multithreading and support for unpacked B matrix in u8s8s16os16 Fucntionality - When the B matrix is not reordered before the u8s8s16os16 compute kernel call packing of B matrix is done as part of the five loop algorithm. The state of B matrix (packed or unpacked) is given as an user input. - Packing of B matrix is done as part of the five loop compute. - Temprorary buffer for pack B is allocated in the five loop algorithm - Multithreading for computation kernel - Configuration constants for u8s8s16os16 are part of the lpgemm config AMD-Internal: [CPUPL-2171] Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8	2022-08-05 19:28:37 +05:30

1 2 3 4 5 ...

2733 Commits