amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 02:14:33 +00:00

Author	SHA1	Message	Date
Shubham Sharma.	2b02024404	Fixed bug in bli_zdotv_zen4_asm_avx512 kernel. - Data-type of n, and conj is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(n)), the "n" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. Thanks to mangala.v@amd.com for finding this bug. Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5	2024-09-24 02:33:44 -04:00
Shubham Sharma	1833ee70cd	Fixed bug in bli_dgemm_avx512 8x24 native kernel. - Data-type of m, n, k,ldc is dim_t which will be int32_t for LP64 case. - When loading 64-bit registers using "mov" instructions, mov(rax, var(m)), the "m" should be 64-bit otherwise incorrect values gets loaded. Fix: We typecast these variables to int64_t before loading into registers. AMD-Internal: [CPUPL-5819] Change-Id: I16043ac168a79ff9358c0c1768989a81e3c6b0e0	2024-09-23 04:54:04 -04:00
Deepak Negi	16653ed208	Added support for column major B matrix in BF16S4F32F32 reorder API. -Added new pack kernels that packs/reorders B matrix from column-major input format. This also supports the transB scenario if input B matrix is row major. Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f	2024-09-20 01:11:21 +05:30
varshav2	605517964b	Add Transpose Kernel for A matrix in F32F32f32Of32 - Implemented the AVX512 packA kernel for col major inputs in F32 API - Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was being directed to GEMM instead of GEMV. Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f	2024-09-19 06:37:11 -04:00
Edward Smyth	a07e041b1f	SCALV alpha=zero BLAS compliance SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but also within many other APIs to handle special cases. In general it is preferred to use SETV when alpha=0, but BLAS and CBLAS continue to multiple all vector element by alpha. This has different behaviour for propagating NaNs or Infs. Changes in this commit: - Standardize early returns from SCALV reference and optimized kernels. - User supplied N<0 is handled at the top level API layer. Use negative values of N in kernel calls to signify that SETV should _not_ be used when alpha=0. This should only be required in SCALV. - Include serial threshold in zdscal (as in dscal) to reduce overhead for small problem sizes. - Code tidying to make different variants more consistent. - More standardization of tests in SCALV gtestsuite programs. - Remove scalv_extreme_cases.cpp as it is now redundant. AMD-Internal: [CPUPL-4415] Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae	2024-09-16 07:10:28 -04:00
Vignesh Balasubramanian	ed682a429d	Fixed compiler warnings due to prefetch in AVX2 and AVX512 kernels - Added explicit typecast to the pointers that are passed to the _mm_prefetch( ... ) intrinsic, to avoid compiler warnings. AMD-Internal: [CPUPL-4415] Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea	2024-09-12 11:37:06 +05:30
Meghana Vankadari	5120f98e12	Developed all WoQ kernels for bf16s4f32o<f32\|bf16> Description: 1. Written 6x64 main and other fringe kernels for WoQ where scaling s4 weights into bf16 performed in the kernel itself to reduce bandwidth. 2. These kernels are performing better compared to bf16 weights when m is small and n is large. 3. Established a threshold to do quantization support at packing of B (KCXNC) level or WoQ kernel level. Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289	2024-09-10 12:00:10 +00:00
varshav2	298a165718	Add TransA and TransB support for F32F32F32oF32 - Added support for TransA and transB in f32f32of32 APIs - Modified the GEMV case(m == 1) to support PACKB feature - Redirecting the operations to GEMM instead of GEMV in case of n == 1 conditions, with storage scheme r/transA and c/transB to avoid the packing errors which would lead to failures in computation. Change-Id: I0eb8c31485af4e33c53fd36b5e5788d75d3a67a9	2024-09-09 05:19:49 +05:30
Meghana Vankadari	687abe4c96	Bug fix in WOQ kernel for m=4 case. - Updated pre_op_off computation for nr0 < NR cases. - Fixed warnings in bench file. Change-Id: Iae30fa84b6b47ebd94ab05d2139056aee24546d7	2024-09-05 05:00:30 +00:00
Meghana Vankadari	2e1cc2f14a	Added bf16s4f32 kernels to handle m=4 cases Details: - In WOQ, if m = 4, special case kernels are added where s4->bf16 conversion happens inside the compute kernel and packing is avoided. For all other cases, B matrix is dequantized and packed at KC loop level and native bf16 kernels are re-used at compute level. - Fixes in bench to avoid accuracy failures when datatype of output is bf16. Change-Id: Ie8db42da536891693d5e82a5336b66514a50ccb2	2024-09-04 07:36:57 -04:00
Deepak Negi	6dcf500703	Element wise operations API for float(f32) input matrix in LPGEMM. This API supports applying element wise operations (eg: post-ops) on a float(f32) input matrix to get an output matrix of the same (float(f32)). Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24 AMD Internal: [SWLCSG-2947] Change-Id: I387a544f0d33d2231f5f6a92e212f17b1103dd24	2024-08-27 03:28:52 -04:00
Chandrashekara K R	2ff0125f11	CMake: Enabled ADDON(aocl_gemm) feature for Windows. 1. Updated datatype from __int64_t to int64_t. Since __int64_t was not defined for Windows 2. Updated CMake build system to build lpgemm on windows Change-Id: I5fc5ed93ecc54e4a9931b7b40b790d37c7ead4b8	2024-08-23 07:05:28 -04:00
Vignesh Balasubramanian	93631410a3	Bugfix : Fixed memory accesses in AVX512 SGEMMSUP RD kernels - Bug: Among the list of AVX512 SGEMMSUP RD kernels, the ones handling m_fringe = 3 had incorrect usage of ZMM on a vector-load instruction that strictly needed YMMs. - Further updated the existing micro-kernel test cases to simulate these issues and validate the fix. AMD-Internal: [CPUPL-5353] Change-Id: Id86e60ce36bb9f8433a1a203cfe0b8c6347df2c1	2024-08-19 17:18:31 +05:30
Vignesh Balasubramanian	fd47d0aebb	Fixing compiler warnings when exporting internal BLIS kernels - Added the attribute to export symbols, in the header file that contains the L1 kernel declarations. This attribute was previously added as part of the kernel definitions. AMD-Internal: [CPUPL-4415] Change-Id: I375246f47d53c220f885644f9b75c7d7991ae710	2024-08-12 12:30:51 +05:30
Meghana Vankadari	5514c7a75f	Added LPGEMV(n=1) kernels for s8s8s32os32\|s8 and s8s8s16os16\|s8 APIs - When n=1, reorder of B matrix is avoided to efficiently process data. A dot-product based kernel is implemented to perform gemv when n==1. AMD-Internal: [SWLCSG-2354] Change-Id: I6b73dfddd9a15e7b914d031646a1d913a7ab4761	2024-08-09 06:17:52 -04:00
Edward Smyth	7fff7b4026	Code cleanup: Miscellaneous fixes - Delete unused cmake files. - Add guards around call to bli_cpuid_is_avx2fma3_supported in frame/3/bli_l3_sup.c, currently assumes that non-x86 platforms will not use bli_gemmtsup. - Correct variable in frame/base/bli_arch.c on non-x86 builds. - Add guards around omp pragma to avoid possible gcc compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c. - Add missing registers in clobber list in kernels/zen4/1/bli_dotv_zen_int_avx512.c. - Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV. - Correct calls to cblas_{c,z}swap in gtestsuite. - Correct test name in ddotxf gtestsuite program. AMD-Internal: [CPUPL-4415] Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5	2024-08-06 06:56:01 -04:00
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Nallani Bhaskar	e712673ab7	Peformance fixes for gcc compiler in fringe kernels Description: 1. GCC avoiding loading b into registers in m fringe kenrels of int8 kernels. Instead gcc generating fma with memory as an operand for B input. 2. This is causing performance regression for larger n where each fma needs to load the input from memory again and again. 3. This is observed with gcc but not with clang. 4. Inserted dummy shuffle instructions for b data to further explicitly tell compiler that b needs to be in registers. AMD-Internal: SWLCSG-2948 Change-Id: Ibbf186fe6569e6265e2c2bb4ec3141ef323ea3e6	2024-08-05 14:31:22 -04:00
Edward Smyth	591a3a7395	Code cleanup: file formats and permissions - Remove execute file permission from source and make files. - dos2unix conversion. - Add missing eol at end of files. Also update .gitignore to not exclude build directory but to exclude any build_* created by cmake builds. AMD-Internal: [CPUPL-4415] Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124	2024-08-05 11:52:33 -04:00
mkadavil	9f5fec7713	Matrix MUL op support in element wise operations API for bfloat16. -Matrix MUL op support added in main as well as fringe bfloat16 element wise operations kernels. -Benchmarking/testing framework for the same is added. -Fixed issues in setting up post-ops node index. AMD Internal: [SWLCSG-2947, SWLCSG-2953] Change-Id: Iba7561a6a60df41211efbf06fab1b4900207bcf8	2024-08-05 08:29:42 +05:30
Deepak Negi	80bf6249f0	Matrix MUL post-operation support for float(bf16\|f32) LPGEMM APIs. This post-operation computes C = (betaC + alphaAB) D, where D is a matrix with dimensions and data type the same as that of C matrix. AMD-Internal: [SWLCSG-2953] Change-Id: Id4df2ca76a8f696cb16edbd02c25f621f9a828fd	2024-08-05 08:25:32 -04:00
Nallani Bhaskar	4c2f436cce	Peformance fixes for gcc compiler in fringe kernels Description: 1. GCC avoiding loading b into registers in m fringe kenrels of int8 kernels. Instead gcc generating fma with memory as an operand for B input. 2. This is causing performance regression for larger n where each fma needs to load the input from memory again and again. 3. This is observed with gcc but not with clang. 4. Inserted dummy shuffle instructions for b data to further explicitly tell compiler that b needs to be in registers. 5. Moved packb_s4_to_bf16 under JIT macro to resovle compilation issue with gcc version < 11.2 AMD-Internal: SWLCSG-2948 Change-Id: I5bd1bad7ad129e0dde91ed78d49a4ede3bff456a	2024-08-05 08:13:06 -04:00
mkadavil	f040ba617f	Element wise operations API for bfloat16 input matrix in LPGEMM. -This API supports applying element wise operations (eg: post-ops) on a bfloat16 input matrix to get an output matrix of the same(bfloat16) or upscaled data type (float). -Benchmarking/testing framework for the same is added. AMD Internal: SWLCSG-2947 Change-Id: I43f1c269be1a1997d4912d8a3a97be5e5f3442d2	2024-08-05 07:17:08 -04:00
Arnav Sharma	0a5c057475	DGEMV Optimizations for Tiny Sizes - Added reference kernel for dgemv that handles computation for tiny sizes (m < 8 && n < 8). - The reference kernel, bli_dgemv_zen_ref( ... ), supports both row/column storage schemes as well as transpose and no transpose cases. - Added additional unit-tests for functional verification. AMD-Internal: [CPUPL-5098] Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada	2024-08-05 12:19:42 +05:30
Vignesh Balasubramanian	9843bd0317	Tuning the decision logic to choose SUP vs Native for ZGEMM - Added an additional decision logic to choose between SUP and Native paths for zen4 and zen5 micro-architectures, based on the input dimensions. This logic has been added to the architecture-specific thresholds functions, that are registered in the context. - The decision logic will overrule the discrete thresholds present in the zen4 and zen5 contexts. AMD-Internal: [CPUPL-5547] Change-Id: I475f19b110064b3b9eef2e03bbdc21f4dd826c03	2024-08-03 19:08:07 +05:30
Shubham Sharma	0d95fcf20c	Revert "DGEMM Native AVX512 updates" This reverts commit `f378fc57b5`. Reason for revert: Causing Failure AMD-Internal: [CPUPL-5262] Change-Id: I15860eabf2461fae3d0f7cedd436d4db2df5b82f	2024-08-02 07:32:28 -04:00
Moripalli Chitra	8b486e8d14	Added new decision logic to choose between 6x8 dgemm kernel vs 24x8 kernel. The decision is based on the values of "m, n and k". Change-Id: I307ff002797ccef5bd61106b808cecb069b91fd6	2024-08-02 14:18:58 +05:30
Ruchika Ashtankar	92fbd04238	DGEMM SUP Optimizations for Turin - Introduced a new 24x8 column preferred DGEMM sup kernel for zen5. - A prefetch logic is modified compared to zen4 24x8 sup kernels. - Earlier, next panel of A is prefetched into L2 cache, which is now modified to prefetching the second next column of the current panel of A into L1 cache. - B and C prefetches are enabled and unchanged. - Tuned MC, KC and NC block sizes for new kernel. AMD-Internal: [CPUPL-5262] Change-Id: If933537e50f43f5560e0fe18a716aa1e36ced64d	2024-08-02 04:00:51 -04:00
Ruchika Ashtankar	5760e06100	Threshold tuning for DGEMM SUP for zen5 - New Decision threshold constants are added to decide between double precision sup vs native dgemm code-path for zen5 processors. - The decision is based on the values of m, n and k. AMD-Internal: [CPUPL-5262] Change-Id: I87b8ff9eb603d6fda0875e000f7ab83b22d22040	2024-08-02 11:34:32 +05:30
Vignesh Balasubramanian	4ec2bad744	Updating reduction step of AVX512 DNRM2 API - Updated the final reduction of partial sums to use scalar accumulation entirely, instead of using the _mm512_reduce_add_pd( ... ) intrinsic. This will in turn change the associativity and the rounding-off pattern in the reduction step. - Defined a union data-type to do the same, by having a 512-bit register and a double-precision array as its members. - Updated the declaration and usage of the register variable according to the union definition, for uniformity. AMD-Internal: [CPUPL-5472] Change-Id: I997464a6ec47e4054dca48a000fbd4ac0cfcc679	2024-08-01 17:09:17 +05:30
Shubham Sharma.	f378fc57b5	DGEMM Native AVX512 updates - In the initial patch - for m, n non-multiple of MR and NR respectively we are calling bli_dgemm_ker_var2. Now we have implemented macro-kernel for these fringe cases as well. - Replaced RBP register with R11 in the macro-kernel. - Retuned MC, KC and NC with these new changes. This will result in better performance for matrix sizes like m=4000 or greater when running on single thread. AMD-Internal: [CPUPL-5262] Change-Id: I66c111ceb7feee776703339680d57e8d6d5c809a	2024-07-31 12:23:34 -04:00
Vignesh Balasubramanian	f23b8e636b	AVX2 and AVX512 optimizations for DAXPYV - Removed some of the unrolling factors that affected the performance of AVX2 DAXPYV kernel. In addition to improving the current performance on sizes compatible to single-threaded runs, this will now perform better for tiny sizes as well since the overhead to reach the computation is less. - Updated the vector partitioning logic, by using bli_thread_range_sub( ... ), which ensures that there is no false sharing among multiple threads. - Updated the AOCL-DYNAMIC logic for the API, to include thresholds or zen4 and zen5 micro-architectures. AMD-Internal: [CPUPL-5514] Change-Id: Iee9edddac685334213cd6694421ab3df3547e930	2024-07-31 09:24:36 -04:00
Hari Govind S	e2e95a09b0	Fixing missing registers in end_asm for copyv APIs - Added the missing registers in end_asm for scopy, dcopy and zcopy APIs. - Removed unnecessary registers from end_asm for scopy and dcopy APIs. - Corrected mistakes in the comments. Change-Id: I5ebe2ff9cb2c72ca7c71a67419281f73462f9498	2024-07-30 15:09:52 +05:30
Meghana Vankadari	d5b4d3aa5e	Fixing control flow in aocl_gemm_bf16s4f32of32\|bf16 - Fixed framework of bf16s4f32of32 API to correct pointer updations. - Modified pre_op structure to exclude pre-op-offset. Now offset is passed as a separate parameter to the scale-pack functions. - Fixed work-distribution among threads in MT scenario. - Added Blocksizes and kernel-pointers and verified functionality for the new API. AMD-Internal: [SWLCSG-2943] Change-Id: I58fece240d62c798c880a2b2b7fa64e560cc753d	2024-07-29 05:12:09 -04:00
Nallani Bhaskar	c6dd7c1b4b	Added new API in aocl_gemm to support A bf16 data type and B s4 data type Description: 1. Added a new API aocl_gemm_bf16s4f32of32 to support for WoQ (Weight-only-Quantization) in LLM's 2. The API supports only reordered B matrix of data size signed 4 bits (S4). 3. Substracting zero point and multiplying with scale on B matrix is performed in packing B. 4. zero point and scale data should be passed by user through pre-ops data structure. 5. The API is still in experimental state and NOT tested. AMD-Internal: SWLCSG-2943 Change-Id: I10b159b64c2e2aaf39da5462685618ba8cc800ee	2024-07-25 11:59:03 +00:00
Meghana Vankadari	49949f488f	Implemented on-the-go pack kernel for s4->bf16 Details: - To enable Weight-only-Quantization(WOQ) workflow, new LPGEMM APIs are added where datatypes are A: bf16, B: int4, C: f32/bf16. To support this, B matrix will be reordered with type still being int4. New pack kernels that packs the reordered B matrix after converting the data from int4 to bf16 and applying zero-point and scale are added. AMD-Internal: [SWLCSG-2943] Change-Id: Iabe23dab607913c0114b97cb2b91248babeaac03	2024-07-25 04:13:05 +05:30
mkadavil	7114376519	New kernels for int4 B matrix reordering following BF16 kernel schema. -To enable Weight-only-Quantization (WOQ) workflow, new LPGEMM APIs are required where data types are A:bf16, B:int4 and C:f32/bf16. It is expected that the BF16 kernels will be reused within this API and subsequently the B matrix needs to be reordered following the BF16 kernel schema, but with the reordered matrix type still being int4. To address this, new BF16 reorder kernels enabling the same are added. AMD-Internal: [SWLCSG-2943] Change-Id: Ib770ecbf90a3d906deafece94b1a96e0b9412738	2024-07-25 01:10:13 -04:00
Hari Govind S	eacad443e3	Optimization for DCOPY and SCOPY API - Replaced "vmovupd" with "vmovups" for "bli_scopyv_zen4_asm_avx512" kernel. - Optimization of loop unrolling for "bli_dcopyv_zen4_asm_avx512" and "bli_scopyv_zen4_asm_avx512" kernels. - Replaced existing load balancing algorithm for dcopy API with "bli_thread_range_sub" algorithm. - Included AOCL-dynamic values for optimial number of threads for zen5 architecture. AMD-Internal: [CPUPL-5238] Change-Id: Ic82bdfad9478c8f75dc5a3dcfed0df85fbcae957	2024-07-24 08:23:07 -04:00
Arnav Sharma	9583ee2e23	DGEMV Optimizations for NO_TRANSPOSE cases - Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases. - Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16. - Added a wrapper for DAXPYF kernels for redirection to kernels with a smaller fuse factor than 32. - Also added UKR tests for the new fused kernels. AMD-Internal: [CPUPL-5098] Change-Id: I0b102b67c6c068873393bac0494284f379c253f2	2024-07-24 15:59:36 +05:30
mkadavil	42e539b878	Quantization (scale + zero point) updates/fixes for BF16 LPGEMM api. -_mm512_cvtpbh_ps intrinsic is not supported in older versions of gcc (<gcc 12.2) and subsequently throws a compilation error. This is fixed by replacing this intrinsic with a macro that achieves the bf16 to f32 conversion via shift operations. -Bug fixes in the vector scale factor load in fringe kernels. AMD-Internal: [SWLCSG-2945] Change-Id: I8eac4c4b34b043e7a8116dc465723d8f85b28018	2024-07-23 04:39:14 +05:30
Shubham Sharma	16c56e0101	Added 24x8 triangular kernels for DGEMMT SUP - In order to reuse 24x8 AVX512 DGEMM SUP kernels, 24x8 triangular AVX512 DGEMMT SUP kernels are added. - Since the LCM of MR(24) and NR(8) is 24, therefore the diagonal pattern repeats every 24x24 block of C. To cover this 24x24 block, 3 kernels are needed for one variant of DGEMMT. A total of 6 kernels are needed to cover both upper and lower variants. - In order to maximize code reuse, the 24x8 kernels are broken into two parts, 8x8 diagonal GEMM and 16x8 full GEMM. The 8x8 diagonal GEMM is computed by 8x8 diagonal kernel, and 16x8 full GEMM part is computed by 24x8 DGEMM SUP kernel. - Changes are made in framework to enable the use of these kernels. AMD-Internal: [CPUPL-5338] Change-Id: I8e7007031e906f786b0c4fe12377ee439075207a	2024-07-22 12:02:30 -04:00
Vignesh Balasubramanian	b48e864e82	AVX512 optimizations for DAXPBYV API - Implemented AVX512 computational kernel for DAXPBYV with optimal unrolling. Further implemented the other missing kernels that would be required to decompose the computation in special cases, namely the AVX512 DADDV and DSCAL2V kernels. - Updated the zen4 and zen5 contexts to ensure any query to acquire the kernel pointer for DAXPBYV returns the address of the new kernel. - Added micro-kernel units tests to GTestsuite to check for functionality and out-of-bounds reads and writes. AMD-Internal: [CPUPL-5406][CPUPL-5421] Change-Id: I127ab21174ddd9e6de2c30a320e62a8b042cbde6	2024-07-22 11:32:19 +05:30
Shubham Sharma	75df1ef218	Removed -fno-tree-loop-vectorize from kernel flags - This change in made in CMAKE build system only. - Removed -fno-tree-loop-vectorize from global kernel flags, instead added it to lpgemm specific kernels only. - If this flag is not used , then gcc tries to auto vectorize the code which results in usages of vector registers, if the auto vectorized function is using intrinsics then the total numbers of vector registers used by intrinsic and auto vectorized code becomes more than the registers available in machine which causes read and writes to stack, which is causing regression in lpgemm. - If this flag is enabled globally, then the files which do not use any intrinsic code do not get auto vectorized. - To get optimal performance for both blis and lpgemm, this flag is enabled for lpgemm kernels only. Change-Id: I14e5c18cd53b058bfc9d764a8eaf825b4d0a81c4	2024-07-19 00:49:52 -04:00
Vignesh Balasubramanian	cec9fdcc6e	Framework enhancements for ?AXPBYV APIs - Implemented a new front-end for the BLAS/CBLAS calls to ?AXPBYV(BLAS-extension API), that is intended to be compiled only on Zen micro-architectures(as per the existing build system). - This new front-end makes the framework lightweight for BLAS/CBLAS calls to ?AXPBYV, by directly querying the architecture ID and deploying the associated computational kernel. - Further updated the rerouting to other L1 kernels based on alpha and beta value. This was initially present in the Typed-API interface. It has been moved inside the respective kernels, and only necessary rerouting is done to specific L1 kernels to avoid redundant checks. AMD-Internal: [CPUPL-5406] Change-Id: I4af943d477a25dcdab4ee6009ad3dfa6a5c2b37e	2024-07-18 10:06:31 -04:00
mkadavil	d37c91dffa	Quantization (scale + zero point) support for BF16 LPGEMM api. -Quantization of f32 to bf16 (bf16 = (f32 * scale_factor) + zero_point) instead of just type conversion in aocl_gemm_bf16bf16f32obf16. -Support for multiple scale/sum/matrix_add/bias post-ops in a single LPGEMM api call. -Post-ops mask related fixes in lpgemv kernels . -Additional scale post-ops sanity checks. AMD-Internal: [SWLCSG-2945] Change-Id: I3b35cc413c176bb50bfdbd6acd4839a5ba7e94bb	2024-07-18 05:32:51 -04:00
Hari Govind S	38824244d5	Implementation of AXPYF Kernels for DTRSV - Implemented two new axpyf kernels for fused factors 8 and 12 by manually unrolling the loops. Used to achieve better performance in var2 case. AMD-Internal: [CPUPL-5184] Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5	2024-07-16 06:23:01 -04:00
Arnav Sharma	4aa66f108e	Added CSCALV AVX512 Kernel - Added CSCALV kernel utilizing the AVX512 ISA. - Added function pointers for the same to zen4 and zen5 contexts. - Updated the BLAS interface to invoke respective CSCALV kernels based on the architecture. - Added UKR tests for bli_cscalv_zen_int_avx512( ... ). AMD-Internal: [CPUPL-5299] Change-Id: I189d87a1ec1a6e30c16e05582dcb57a8510a27f3	2024-07-15 07:17:43 -04:00
Shubham Sharma.	a7744361e4	DGEMM optimizations for Turin Classic - Introduced new 8x24 macro kernels. - 4 new kernels are added for beta 0, beta 1, beta -1 and beta N. - IR and JR loop moved to ASM region. - Kernels support row major storage scheme. - Prefetch of current micro panel of C is enabled. - Kernel supports negative offsets for A and B matrices. - Moved alpha scaling from DGEMM kernel to B pack kernel. - Tuned blocksizes for new kernel. - Added support for alpha scaling in 24xk pack kernel. - Reverted back to old b_next computation in gemm_ker_var2. - BugFix in 8x24 DGEMM kernel for beta 1, comparsion for jmp conditions was done using integer instructions, which caused beta 1 path to never be taken. Fixed this by changing the comparsion to double. AMD-Internal: [CPUPL-5262] Change-Id: Ieec207eea2a164603c8a8ea88e0b1d3095c29a3f	2024-07-09 07:53:27 -04:00
vignbala	236d092656	AVX512 optimizations for ZGEMM to handle k = 1 cases - Implemented bli_zgemm_16x4_avx512_k1_nn( ... ) AVX512 kernel to be used as part of BLAS/CBLAS calls to ZGEMM. The kernel is built for handling the GEMM computation with inputs having k = 1, with the transpose values being N(for column-major) and T(for row-major). - Updated the zgemm_blis_impl( ... ) layer to query the architecture ID and invoke the AVX2 or AVX512 kernel accordingly. - Added API level tests for accuracy and code-coverage, as well as micro-kernel tests for verifying functionality and out-of-bounds memory accesses. AMD-Internal: [CPUPL-5249] Change-Id: Id1f8bebff3e0da83c7febe86299564fd658b2e84	2024-07-09 07:07:24 -04:00

1 2 3 4 5 ...

795 Commits