amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-25 02:44:31 +00:00

Author	SHA1	Message	Date
Nallani Bhaskar	75f72b7f6e	Added aocl dynamic feature for dtrsm for small sizes Details: 1. Added aocl-dynamic for dtrsm native path When (m,n)<512 better performance observed for nthreads=4 2. Updated trsm_small threshold such that when (m+n)<320 trsm_small is doing better than native irrespective of number of threads Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487	2021-06-18 08:46:47 -04:00
Chandrashekara KR	d7377f967c	Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1	2021-06-17 08:49:55 -04:00
Kiran Varaganti	d26089c665	Multi-threaded BLIS - OpenMP Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes are not set, number of threads is inferred by calling the API - omp_get_max_threads(). Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads. This feature is only enabled when BLIS is configured with openmp parallelization. Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6	2021-06-17 05:17:37 -04:00
Meghana Vankadari	d5ff5e5f50	Added dynamic threading support for SYRK SUP code path Details: - when AOCL dynamic is enabled, the decision to choose ST Vs MT to solve SYRK is taken based on dimensions of matrices. - Decisions to choose optimum number of threads will be updated in the subsequent commits. - Only local copy of rntm is modified by AOCL Dynamic feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Added an early-exit condition in bli_nthreads_optimum when nt =1 or nt=-1. This ensures that AOCL dynamic feature is not used when threading is set using BLIS_IC_NT or BLIS_JC_NT. Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0	2021-06-16 02:08:11 -04:00
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Chandrashekara K R	f94e3ad237	AOCL-Windows: Update BLIS build system 1. Added support in cmake scripts for linking libomp for blis multithreading build. 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file. 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's. 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS. 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file. AMD Internal : [CPUPL-1630] Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f	2021-06-15 16:49:08 +05:30
Kiran Varaganti	c2abbcab96	Fix dgemm_ Multi-thread running as Single Thread Details: When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed. Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1 irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls small_gemm which ends up running sequentially. Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one. Add fix for zgemm_ also. Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573	2021-06-15 12:14:11 +05:30
Dipal M Zambare	2f344f5df1	DTL logs corrections -- Fixed issues in printing the values of side, uploa and diaga parameters for hemm, hemv, her, her2, her2k, herk, symm, symv, syr, syr2, syr2k, syrk, trmm, trmv, trsm, trsv. -- For above API's logging was called with MKSTR() for side, uploa and diaga parameters. MKSTR is needed only for macro arguments but not for function's arguments. -- Added space between function name and data type where it was missing. Bench expects logs in this format. AMD-Internal: [CPUPL-1585] Change-Id: Ib6ab66890e68cfa52860f869d6a1c34e78036a2d	2021-06-04 15:24:13 +05:30
Nagarapu Phanikumar	7ea32e6d0b	Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1	2021-06-03 06:03:26 -04:00
nphaniku	2bdee3cd6c	Unifying BLIS Windows and Linux codebase 1. Removed dependency on bli_config.h inclusion in blis.h 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags. 3. CMAKE changes to incorporate new changes as per 3.1 code base. 4. Removed zen2 folder from Windows directory. AMD Internal : [CPUPL-1532] Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47	2021-06-03 15:28:10 +05:30
mkurumel	9afbb11b4f	DTL Logging bug in GEMV Details : - Fixed Incorrect Macro used in dgemv and cgemv Trace logging exit. AMD-Internal: [CPUPL-1403] Change-Id: Icac502d8d4adad112754d9c764a30d3db56a743f	2021-06-02 21:21:00 +05:30
mkurumel	99e3bce065	SGEMV : single Precision axpyf kernel optimization for SGEMV Details : - Implemented saxpyf kernel with fuse factor=6 for sgemv. AMD-Internal: [CPUPL-1403] Change-Id: I72fd30c08a789603267cf58910138549d45d231a	2021-06-02 07:55:48 -04:00
Nageshwar Singh	2e1a5bc1dd	Optimized double complex axpyf kernel for zgemv Details: - Implemented zaxpyf kernel with fuse factor=4 for zgemv. - Modified BLAS interface call for zgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: I2231285fe3060982d4434466346a040b7ab803fc	2021-06-01 18:03:29 +05:30
Kiran Varaganti	ff84d37930	Merge "SUP GEMM - Enable only block panel (var2m)" into amd-staging-milan-3.1	2021-05-31 06:46:04 -04:00
Meghana Vankadari	887ecb46e0	Added threshold logic for SYRK Details: - Added decision logic to choose between SUP and native implementations of SYRK for zen2 architectures. - For architectures other than zen2 it will be redirected to gemm threshold function. Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d	2021-05-31 05:34:38 -04:00
Kiran Varaganti	aa9f5b8b37	SUP GEMM - Enable only block panel (var2m) Completely disabling supvar1n (Panel Block) gemm to simplify things supvar1n perform better only when m >> and n=k=small (<10). This simplification will improve performance for m = n shape dgemm. Change-Id: I523fcb211e8ab92718ea7367f9707a38275e24b1	2021-05-30 21:22:44 +05:30
Madan mohan Manokar	6d6f746190	3m1 turning OFF since 3m1 is turned off in bla_gemm.c, setting FALSE for 3m1 in bli_l3_ind_oper_st AMD-Internal: [CPUPL-1592] Change-Id: I80dfe7c993f9edfbf752b7351cfdaa22a9e60035	2021-05-26 10:06:54 +05:30
Meghana Vankadari	4446395047	Redirecting dgemv to axpyf based implementation for smaller sizes. AMD-Internal: [CPUPL-1403] Change-Id: I0ff2763c41c5ae598c58bc250adc317d7f8a4994	2021-05-25 01:39:12 -04:00
satish kumar nuggu	82087773a0	Optimized single threaded dtrsm small for right cases Details: 1. Added optimized dtrsm kernels for all 8 right side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a01 block required for GEMM operation. The block size increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM from one end of A to other end of triangular A b. Packing of 6 diagonal elements in one location helped to utilize cache line efficiently AMD-Internal: [CPUPL-1563] Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3	2021-05-25 01:09:50 -04:00
Meghana Vankadari	8c9a7c21b4	Optimized axpyf kernel for scomplex datatype Details: - Implemented axpyf kernel with fuse factor=4 for scomplex datatype. - Modified BLAS interface call for cgemv to reduce framework overhead. - Directed gemv to dotv in the case where dimension of y vector is 1. - when alpha = 0, gemv becomes scalv of Y with beta. Added code to return early after scaling Y vector with beta. AMD-Internal: [CPUPL-1402] Change-Id: Ibaab078008d76953332ba4da3515993578c0e586	2021-05-24 14:40:17 +05:30
Dipal M Zambare	5f53d14971	Added bench utility for dotv and scalv APIs. - Added bench utility for dotv and scalv API's - Corrected logging for scalv to handle complex types - Corrected logging to remove transpose field from dotv logs AOCL-Internal: [CPUPL-1577] Change-Id: Ieb29e773309de1520c7fa5b79b97c943d894ba07	2021-05-21 10:00:32 +05:30
Nallani Bhaskar	b2e68b9812	Merge "Added optimized single threaded dtrsm small for left cases" into amd-staging-milan-3.1	2021-05-19 00:47:56 -04:00
Nallani Bhaskar	3a2e4c3db8	Added optimized single threaded dtrsm small for left cases Details: 1. Added optimized dtrsm kernels for all 8 left side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a10 block required for GEMM operation. The block size increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM from one end of A to other end of triangular A b. Performing inregister transpose whenever required c. Packing of 8 diagonal elements in one location helped to utilize cache line efficiently 2. Enabled calling dtrsm small for smaller sizes at cblas level itself to avoid frame work overhead, which is significant for very small sizes 3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun and manideep.kurumella@amd.com for implementing lut kernels using intrinsics. 4. Removed all older implementations of strsm which are not developed as per the guide lines, can be refered from older releases if required. Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6	2021-05-18 16:16:00 +05:30
Dipal M Zambare	9e27065c2b	Fixed blastest failure for amd64 configuration - When building for amd64 configuration, small matrix support for dgemm is not enabled (yet). Functions supporting small matrix implementation are called even when small matrix support is disabled. Update code to prevent this. AMD-Internal: [CPUPL-1575] Change-Id: I3a1692e965679cfde44938b1d26951145c790aa0	2021-05-18 03:24:06 -04:00
Meghana Vankadari	dc2d6ee763	Moved dynamic threading function from GEMMT to GEMM Details: - Current tuning for choosing optimal number of threads is done for GEMM. - Dynamic thread calculation function was placed in gemmt code flow instead of gemm by mistake. Fixing it with this commit. AMD-Internal: [CPUPL-1376] Change-Id: Iccb42a7a617b9b4cdb4c4af9be21aa82aaaabbcc	2021-05-07 12:10:53 +05:30
Meghana Vankadari	33ddf2e448	Fixed blastest failure for haswell configuration Details: - Placed optimized version of BLAS DGEMM, ZGEMM definitions under BLIS_CONFIG_EPYC as they use gemm small which are defined only for zen family configurations. - Added code to query and set cntx in gemv and trsv framework before cntx is referred for any function pointers to avoid querying from NULL pointer. AMD-Internal: [CPUPL-1562] Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9	2021-05-07 01:49:54 -04:00
Meghana Vankadari	eea347b02e	Added dynamic threading support for GEMM SUP code path Details: - Introduced new feature called AOCL_DYNAMIC. - When this macro is defined, Optimum number of threads to solve DGEMM is estimated based on the dimensions (M,N,K). - Range of optimum number of threads will be [1, num_threads], where "num_threads" is number of threads set by the application. - Num_threads is derived from either environment variable "OMP_NUM_THREADS or BLIS_NUM_THREADS' or bli_set_num_threads() API. - Only local copy of rntm is modified by AOCL_DYNAMIC feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Optimum number of threads calculation is done only for SUP. - Since 'native' code path handles larger problem sizes, we use max number of threads recommended by the application. AMD-Internal: [CPUPL-1376] Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3	2021-05-07 09:52:51 +05:30
Madan mohan Manokar	c1fa9abe32	zgemm native path tuning 1. NC and MC values are tuned for both single-instance and multi-instance run. 2. zen2 and zen3 configs updated. 3. SUP path disabled for zgemm, since tuned native path performed better. To be re-enabled after setting right threshold for SUP selection. AMD-Internal: [CPUPL-1442] Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239	2021-05-06 01:15:35 -04:00
Meghana Vankadari	1303732e83	Added sup functionality for SYRK Details: - Added bli_syrksup function that internally uses gemmt implementation. - Modified OAPI of syrk to call SUP before proceeding to the conventional implementation. - Copied gemmsup threshold function for syrk temporarily. Thresholds are yet to be derived for syrk. Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d	2021-04-29 12:35:30 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Madan mohan Manokar	f6088ac1cf	Enabling 3m_sqp and 3m1 methods 1. Re-enabling 3m methods for zgemm. 2. Vectorization of pack_sum routines re-enabled with bug fix. 3. 8mx6n kernel added. AMD-Internal: [CPUPL-1352] Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810	2021-04-22 02:47:31 -04:00
Madan mohan Manokar	997133dc11	sup zgemm improvement 1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m. 2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done. Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e AMD-Internal: [CPUPL-1352]	2021-04-09 01:45:04 -04:00
Kiran Varaganti	a7b7fcae59	Merge "Fix test_dotv.c when complex-return=intel" into amd-staging-milan-3.1	2021-04-06 10:01:45 -04:00
Kiran Varaganti	657f58b82b	Fix test_dotv.c when complex-return=intel When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes. Now "return parameter" will become the first argument of these functions and these functions return void. This fix addresses the change in the function declarations when --complex-return=intel is enabled. Missing Trace and Log statements for this configuration are now added. [CPUPL-1376] Change-Id: Ib420989da71839211c16088bf431a2ad775a3978	2021-04-06 15:38:12 +05:30
Madan Mohan Manokar	2aad3fbe55	Merge "disabled zgemm induced and gemm sqp temporarily." into amd-staging-milan-3.1	2021-04-05 05:51:12 -04:00
Madan mohan Manokar	7112b73d0d	disabled zgemm induced and gemm sqp temporarily. 1. mx1, mx4 kernel addition and framework modification. 2. 8mx6n kernel addition. 3. NULL check added in dgemm_sqp malloc. 4. mem tracing added. 5. Restricted 3m_sqp to limited matrix sizes. 6. Induced methods disabled temporarily for debug. AMD-Internal: [CPUPL-1352] Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2	2021-04-04 20:43:03 +05:30
Dipal M Zambare	5562a27823	Added check for zero dimensions and early return in ?gemv and ?scal API. If the one of the passed dimensions is zero, these API's will perform early return and avoid crashes in case any other pointer inputs are null. AMD-Internal: [SWLCSG-602] Change-Id: Ibe8902beef286410707a2a88e94b933b49975c85	2021-03-26 21:08:22 +05:30
Field G. Van Zee	4493cf516e	Redefined BLIS_NUM_ARCHS to update automatically. Details: - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum value in the arch_t enum. This means that it no longer needs to get updated manually whenever new subconfigurations are added to BLIS. Also removed the explicit initial index assigment of 0 from the first enum value, which was unnecessary due to how the C language standard mandates indexing of enum values. Thanks to Devin Matthews for originally submitting this as a PR in #446. - Updated docs/ConfigurationHowTo.md to reflect the aforementioned change.	2021-03-15 13:12:49 -05:00
Field G. Van Zee	a4b73de84c	Disabled _self() and _equal() in bli_pthread API. Details: - Disabled the _self() and _equal() extensions to the bli_pthread API introduced in d479654. These functions were disabled after I realized that they aren't actually needed yet. Thanks to Devin Matthews for helping me reason through the appropriate consumer code that will appear in BLIS (eventually) in a future commit. (Also, I could never get the Windows branch to link properly in clang builds in AppVeyor. See the comment I left in the code, and #485, for more info.)	2021-03-12 19:47:39 -06:00
Field G. Van Zee	f9d604679d	Added _self() and _equal() to bli_pthread API. Details: - Expanded the bli_pthread API to include equivalents to pthread_self() and pthread_equal(). Implemented these two functions for all three cpp branches present within bli_pthread.c: systemless, Windows, and Linux/BSD.	2021-03-12 19:47:39 -06:00
Field G. Van Zee	fa9b3c8f6b	Shuffled code in Windows branch of bli_pthreads.c. Details: - Reordered the definitions in the cpp branch in bli_pthreads.c that defines the bli_pthreads API in terms of Windows API calls. Also added missing comments that mark sections of the API, which brings the code into harmony with other cpp branches (as well as bli_pthread.h).	2021-03-11 15:13:51 -06:00
Field G. Van Zee	95d4f3934d	Moved cpp macro redef of strerror_r to bli_env.c. Details: - Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r (in terms of strerror_s) from bli_thread.h to bli_env.c. It was likely left behind in bli_thread.h in a previous commit, when code that now resides in bli_env.c was moved from bli_thread.c. (I couldn't find any other instance of strerror_r being used in BLIS, so I moved the #define directly to bli_env.c rather than place it in bli_env.h.) The code that uses strerror_r is currently disabled, though, so this commit should have no affect on BLIS.	2021-03-11 13:50:40 -06:00
Field G. Van Zee	8a3066c315	Relocated gemmsup_ref general stride handling. Details: - Moved the logic that checks for general stridedness in any of the matrix operands in a gemmsup problem. The logic previously resided near the top of bli_gemmsup_int(), which is the thread entry point for the parallel region of the current gemmsup implementation. The problem with this setup was that the code would attempt to reject problems with any general-strided operands by returning BLIS_FAILURE, and that return value was then being ignored by the l3_sup thread decorator, which unconditionally returns BLIS_SUCCESS. To solve this issue, rather than try to manage n return values, one from each of n threads, I simply moved the logic into bli_gemmsup_ref(). I didn't move it any higher (e.g. bli_gemmsup()) because I still want the logic to be part of the current gemmsup handler implementation. That is, perhaps someone else will create a different handler, and that author wants to handle general stride differently. (We don't want to force them into a particular way of handling general stride.) - Removed the general stride handling from bli_gemmtsup_int(), even though this function is inoperative for now. - This commit addresses issue #484. Thanks to RuQing Xu for reporting this issue.	2021-03-09 17:52:59 -06:00
nphaniku	e3cc577ec1	AOCL Windows: 3.1 BLIS changes 1. Incorporated code review comments . 2. Updated Copyright to 2021. AMD Internal : [CPUPL-1422] Change-Id: I722b0f71daae029a3dcc2cbd029524ea39ca78e6	2021-03-09 17:35:57 +05:30
nphaniku	d78defa0fc	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for adding new files to the build. 2. Added Upper case support for couple of API's. 3. bool is not support in clang so defined it. AMD Internal : [CPUPL-1422] Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61	2021-03-08 22:32:13 +05:30
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Kiran Varaganti	12d13629f9	Fix Debug Trace Log in dgemm_ and zgemm_ Replaced "MKSTR(ch)" in the DTL call "AOCL_DTL_LOG_GEMM_INPUTS(AOCL_DTL_LEVEL_TRACE_1, MKSTR(ch)...)" with "D" and "Z" for dgemm_ and zgemm_ respectively to prevent printing wrong data-type. [CPUPL-1449] Change-Id: Ic91537189352bdb164411799e127de990a5c9a08	2021-03-02 15:16:21 +05:30
RuQing Xu	b8dcc5bc75	Fixed typed API definition for gemmt (#476 ) Details: - Fixed incorrect definition and prototype of bli_?gemmt() in frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously defined identically to gemm, which was wrong because it did not take into account the uplo property of C. - Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md. Specifically, the document erroneously listed only a single transab parameter instead of transa and transb.	2021-03-01 16:58:24 -06:00
Meghana Vankadari	22d4689360	Implemented 16x3 based gemm kernel for the case where A has transpose Details: - This implementation does a transpose operation while packing 16xk of A buffer and passes it to 16x3-nn kernel. - The same implementation works for the case where B has transpose. AMD-Internal: [CPUPL-1376] Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17	2021-02-18 16:47:14 +05:30

1 2 3 4 5 ...

1017 Commits