amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Nageshwar Singh	dbd7b28373	Development of AVX2 axpyv kernels for c and z datatypes. Details - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv). - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family. AMD-Internal: [CPUPL-1231] Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7	2020-10-23 09:33:35 +05:30
Dipal M Zambare	e0e0760ed6	AOCLDTL: Corrected mapping between BLIS threads and trace files Replaced gettid() syscall with omp_get_thread_num() to create files for logging the data. This will ensure that there is one to one mapping between threads created in BLIS and ID's used to name trace and log files. AMD-Internal: [CPUPL-1236] Change-Id: I45b1721a7a9c855eeec43e7cbb5089f2a955ff72	2020-10-22 15:03:39 +05:30
Nageshwar Singh	b245ea9c65	cblas_?cabs1 test cblas header file bug Details: - Added BLIS_ENABLE_CBLAS around cblas header file. AMD-Internal: [CPUPL-1129] Change-Id: I3baacd26aa96c8eeb753d95210817ffe9b2a3f85	2020-10-13 01:43:55 -04:00
managalv	90f30e4c37	Optimised dotv kernel by SIMD approach and by removing framework overhead Details: - Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions. - Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance AMD-Internal: [CPUPL-1057] Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0	2020-10-13 18:59:31 +05:30
Meghana Vankadari	029ed033f1	Added decision logic for zgemmt AMD-Internal: [CPUPL-1032] Change-Id: I8ba1c66b06cd91a864b16a249b263b3694ac1d5e	2020-10-09 12:11:53 +05:30
Meghana Vankadari	016885348c	Added CBLAS interface and test file for gemm_batch API AMD-Internal: [CPUPL-1184] Change-Id: Icc5c41429b0d92f1a66a955769cc0518ca4706ee	2020-10-07 06:55:08 -04:00
Meghana Vankadari	47744663d9	Enabling framework optimizations for zen family architectures. Details: - Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas framework optimizations for zen family configurations. - The macro needs to be defined in family.h files of respective arch configs. - Moved zen2-specific optimized kernels to zen folder, in order to be accessible to all zen family architectures. Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d	2020-10-07 13:10:50 +05:30
Meghana Vankadari	74c9d3f36e	Added decision logic for DGEMMT for zen2 configuration AMD-Internal: [CPUPL-1044] Change-Id: Ifc4b82dcfce5aa6770928010d430b0832c07cb41	2020-10-05 12:50:44 +05:30
Kiran Varaganti	aa56c36b82	Fixed Logs & code cleanup Fixed AOCL DTL logs printing incorrect alpha and beta values for single precision. Added missing info like data-types, lower or upper traingular and Side parameter in the case of TRSM. Code cleanup and formatting the files test_cabs1.c, test_axpbyv.c and test_gemm.c. In dumping trsm parameters replaced 'side' with bli_is_right(side). Change-Id: Ic81503ae696956eb074ec208f7109d1a394183d7	2020-10-04 22:17:31 +05:30
Meghana Vankadari	9a330f1754	Added debug trace and log support for gemmt and TRSM APIs Details: - Added debug trace support for DGEMMT and DTRSM APIs. - Added log support for gemmt, trsm APIs. - Modified gemm dump_sizes function to dump transpose parameters. AMD-Internal: [CPUPL-1210] Change-Id: Ice1effe27ec349203ce5def030a6b85b204bd91e	2020-10-02 12:31:47 +05:30
Meghana Vankadari	b5d0c81178	Added BLAS interface for gemm_batch API Details: - gemm_batch API computes a series of GEMM for groups of general matrices. - Each group contains matrices with same parameters. - This API is part BLAS extension APIs. AMD-Internal: [CPUPL-1184] Change-Id: Ic23772830eb1d157da4db45158a039b0826419fd	2020-09-24 23:52:31 -04:00
Nageshwar Singh	5243da5cec	BLIS: CBLAS Extensions. cblas_?cabs1 : Absolute value of a complex number cabs1 Details: - added cblas extension cblas_?cabs1. - Functionality : res=\|Re(z)\|+\|Im(z)\|, z is a complex number, and res is a value containing the absolute value of a complex number z. AMD-Internal: [CPUPL-1129] Change-Id: I4a3c265c89527c8fd3060c5d2ed38b1953ce6343	2020-09-23 19:01:02 +05:30
Meghana Vankadari	80828f6fda	Added few changes in test_axpbyv.c file Details: - Corrected "#if" directive in line 89 - Commented out "#define print" to disable printing the vectors Change-Id: I9ec3cbfb716540dd3e2264f5c3925d9e0c0c294a	2020-09-16 23:48:01 -04:00
Kiran Varaganti	5a8bd9f41c	Enable BLAS interface in BLIS Modified Makefile in test folder to enable calling BLAS interfaces for BLIS as well. This is possible by replacing -DBLIS with -DBLAS=\"aocl\" in the makefile. Also added linking to multi-threaded MKL library. Change-Id: Iccf2ec99b48bb35da985b69218bc680f678ff7c9	2020-09-16 23:26:44 +05:30
Meghana Vankadari	6c9bf36424	Added BLAS and CBLAS interfaces for axpby API Details: - The axpby routines perform a vector-vector operation defined as y = ax + by where a, b are scalars and x, y are vectors. - This API is part of BLAS-like extension APIs Change-Id: I17a53b03bba97de7ae1995a9f086084bd241bcdc AMD-Internal: [CPUPL-1118]	2020-09-16 09:49:31 +05:30
Meghana Vankadari	43d90e3110	Handling beta=0 case seperately for gemv inside bli_dgemv_zen_ref_c function Details: - For GEMV whenever beta = 0, we should not scale vector 'y' with beta, instead overwrite the 'y' vector with zeroes before carrying out the operation. Change-Id: I159afba6c6ac3b72b74718fab7a4f4ec293012c5	2020-09-08 07:50:34 -04:00
dzambare	95bbdc12ab	Added -fomit-frame-pointer in kernel options Some of the SUP kernels now use rbp register. This register was also used by compiler to support automatic function call tracing, which was creating the conflict. Automatic call tracing feature is removed for now. If needed it can be enabled for non kernel code. Change-Id: Ib7ad00875f501ee2ad552cbb2ecdc245002d63b7 AMD-Internal: [CPUPL-1135]	2020-08-31 09:53:16 +05:30
Kiran Varaganti	339a9314c4	Fixed merges in testcpp/test.sh Change-Id: I3c388448a342153f2ee9c5a6a7ae102ebd1c0ea0	2020-08-14 10:52:49 +05:30
Kiran Varaganti	fb0b4b57c1	Fixed missing changes Corrections in bli_gemm_front.c, taken the corrections from both public repo and 2.2.1 branch [CPUPL-1067] Change-Id: I4887ece6aa20bdfb87d97e7acebbe04cb9feea02	2020-08-13 00:29:37 +05:30
bhaskarn	d186cfdf2e	CPUPL-1074: - Bug fix in sgemmsup 1x16 Kernel for Beta Zero and with C col storage rcx register incrementing was missing because of this 4 values in output are overwritten Change-Id: Ia3028040dce3e615f1db5a331498d86faadcf916	2020-08-11 01:26:26 -04:00
Dipal M Zambare	7bbcae5a18	Fixed build issue in cpp testsuite. This issue was caused by incorrect merging of cpp and testcpp files. Change-Id: Idc40fbdaa55b6052a6a061d2d3e5cfae76b99916 AMD-Internal: [CPUPL-1067]	2020-08-10 12:17:09 +05:30
dzambare	3177db4888	Updated version number. Change-Id: Iba3659b04f2d85ec7dc008ceb84da73c7c66530a	2020-08-06 15:00:46 +05:30
dzambare	f30c3c7766	mend Change-Id: Iba3659b04f2d85ec7dc008ceb84da73c7c66530a	2020-08-06 14:29:24 +05:30
dzambare	267a959af1	Rebased amd-staging-milan-3.0 branch on master -- Rebased on top of master commit # `6e522e5823` -- Updated merged code to remove duplicated code added by auto-merging -- Updated merged code to rename bool_t type -- Updated merged code to rename bli_thread_obarrier -- Updated merged code to rename bli_thread_obroadcast Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c AMD-Internal: [CPUPL-1067]	2020-08-06 10:09:29 +05:30
phakumar	c7a914411f	BLIS library porting on to Windows: GEMMT changes porting on to Windows AMD Internal : [CPUPL-1061] Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a	2020-08-06 10:09:29 +05:30
Mangala V	5b8c2bc9e2	Revert "CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed" This reverts commit `725bf5aceb`. Reason for revert: <INSERT REASONING HERE> Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8	2020-08-06 10:09:29 +05:30
managalv	2a0928aad4	CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed Details: - Problem: If row major, first four elements of last column on output matrix C was not updated If col major, first four elements of last row on output matrix C was not updated - Solution: Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8() Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae	2020-08-06 10:09:29 +05:30
Dipal M Zambare	4f69332879	Added testsuite for gemmt APIs. The testsuite coveres all combinations of upper, lower, transpose and API formats. AMD Internal: [CPUPL-1021] Change-Id: I2a1d79eba1dcaf4217fd9c2c346bd6173b80a782	2020-08-06 10:09:29 +05:30
Meghana Vankadari	4b8a84b9bc	Added some optimizations for gemmt default path Details: - If there are any zero rows or columns along the edges of MCxNC block of C, shrink the dimensions to avoid "no-op" iterations. - For lower-triangle kernel variant, Added a flag to determine if a block that is strictly below triangle is reached. Once such block is reached, the flag is set and all the blocks that are below it are strictly below the diagonal and flag is used to make decision. - For upper-triangle kernel-variant, whenever a block that is strictly below the triangle is reached, break the for loop and go for next iteration of JR loop because all the blocks below it will also be strictly below diagonal and are filled with zeroes which requires no computation. Change-Id: I606b0f900509aab6ed7ff30cefee9d7207b7b010	2020-08-06 10:09:29 +05:30
Meghana Vankadari	9f28a28cbf	Added support to handle col-major storage of C in SUP kernel Details: - Unlike default path, storage scheme of C is not always row-major in SUP. - Whenever C is col-major, the temporary buffer 'ct' is also chosen to be col-major. - Since update routines only support row-major order, a transpose is induced for c and ct buffers before passing them to update routine. Change-Id: I3fea10860f39632df7540c9399786e7aa1cfba37	2020-08-06 10:09:28 +05:30
dzambare	536edc4088	Added support for zen3 configuration - User can now specify zen3 configuration, currently it reuses block sizes and kernels from zen2. - Auto configuration can detect and enable if zen3 config is needed - Added support for amd64 bundle which contains all zen platforms - Moved exiting amd bundle to amd64 legacy. AMD-Internal: [CPUPL-500, CPUPL-1013] Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957	2020-08-06 10:09:28 +05:30
Dipal M Zambare	bee39108d1	Added support for logging gemm input values. Added BLIS specific extension to AOCL DTL, in this added support to print the input matrix sizes from BLIS library. AMD Internal: [CPUPL-806] Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561	2020-08-06 10:09:28 +05:30
dzambare	48b89f50e4	Checking for zero dimension is moved to bli_gemm_xx call. This will ensure early return in case full gemm processing is not needed. Based on dimension which is found to be zero following actions will be taken: If 'c' has zero dimension, no further processing is requried If alpha is zero or if 'a' or 'b' has zero diemension, we perform scalm operation instead of gemm. (c = alphaa + betab) Change-Id: Icc031944fc4e80138adf991974547f2d57ab570b AMD-Internal: [CPUPL-904]	2020-08-06 10:09:28 +05:30
Field G. Van Zee	88ac67a8cf	Renamed bli_thread_obarrier(), _obroadcast(). Details: - Renamed two bli_thread_*() APIs: bli_thread_obarrier() -> bli_thread_barrier() bli_thread_obroadcast() -> bli_thread_broadcast() The 'o' was a leftover from when thrcomm_t objects tracked both "inner" and "outer" communicators. They have long since been simplified to only support the latter, and thus the 'o' is superfluous. Change-Id: If9ec9a2383dfb02e1cfc74918f87a1fabddbd55b	2020-08-06 10:09:28 +05:30
Field G. Van Zee	26cd966af7	Merged test/sup, test/supmt into test/sup. Details: - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able to compile and run both single-threaded and multithreaded experiments. This should help with maintenance going forward. - Created a test/sup/octave_st directory of scripts (based on the previous test/sup/octave scripts) as well as a test/sup/octave_mt directory (based on the previous test/supmt/octave scripts). The octave scripts are slightly different and not easily mergeable, and thus for now I'll maintain them separately. - Preserved the previous test/sup directory as test/sup/old/supst and the previous test/supmt directory as test/sup/old/supmt. Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed	2020-08-06 10:09:28 +05:30
Meghana Vankadari	01e1a41c95	Optimized DGEMV kernel and changed BLAS interface call Details: - Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of dgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for DGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. They will be enabled for zen family processors in future. - Changed naming convention for new BLAS macros to indicate their use. - Added new optimized kernel for axpyf under zen2 folder. - Implemented basic GEMV kernel without using axpyv or axpyf. This kernel is chosen for small sizes. Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran	013d6a2cab	Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. -setv simd kernel is added for single and double precision elements Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-817]	2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran	0ecc147f03	Adding a simd kernel for copyv function Details: - Separate kernel for copyv function added to improve performance. - Modified cntx_init file in zen and zen2 configuration - Added test_copyv.c in test folder - Modified test/Makefile to include test_copyv.c Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41 Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-08-06 10:09:28 +05:30
Meghana	aee0376b16	Added opt kernels for SWAPV Details: -Added SIMD kernels for SWAPV for both single and double precisions. -Modified cntx_init file for zen and zen2 configurations to choose opt kernels for SWAPV. -Added test_swapv.c in test folder. -Modified test/Makefile to include test_swapv.c Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847]	2020-08-06 10:09:28 +05:30
Meghana	560bf86166	Made some critical changes to small_gemm kernels Details: - In case of GEMM, whenever beta is zero, we need to perform C = alpha (A B) instead of C = beta * C + alpha * (A * B) Added conditions to check the value of beta at different levels inside small_gemm kernels and decide whether to perform scaling C with beta or not. -Modified small_gemm kernels to use BLIS specific functions to retrieve different fields of objects. -Calling bli_gemm_check before entering bli_gemm_small to facilitate early return in case of invalid inputs. -For corner cases inside small_gemm kernels, a buffer called f_temp is used to load and store data to and from registers. populating the buffer with zeroes before use. -In bli_gemm_front, datatypes of status and return value from bli_gemm_small are not matching. Corrected the datatype of the variable 'status' inside bli_gemm_front to err_t. Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-689,SWLCSG-104]	2020-08-06 10:09:28 +05:30
Field G. Van Zee	cec95ea3a9	Skip building thrinfo_t tree when mt is disabled. Details: - Return early from bli_thrinfo_sup_grow() if the thrinfo_t object address is equal to either &BLIS_GEMM_SINGLE_THREADED or &BLIS_PACKM_SINGLE_THREADED. - Added preprocessor logic to bli_l3_sup_thread_decorator() in bli_l3_sup_decor_single.c that (by default) disables code that creates and frees the thrinfo_t tree and instead passes &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the sup implementation. - The net effect of the above changes is that a small amount of thrinfo_t overhead is avoided when running small/skinny dgemm problems when BLIS is compiled with multithreading disabled. Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0	2020-08-06 10:09:28 +05:30
Field G. Van Zee	2839b88f00	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. AMD-Internal: [CPUPL-713] Change-Id: I9536648e7befac4d2dc17805e44ef34470961662	2020-08-06 10:09:28 +05:30
Meghana Vankadari	a03a0f8f70	Made framework changes to initialize specific cache block sizes for TRSM. Details: -This commit addresses the performance optimization(single-thread and multi-thread) for DTRSM on zen2. -This new optimization employs different MC, KC & NC values for TRSM than what is being used in other Level-3 routines like DGEMM. -Changed TRSM framework code to choose these blocksizes for TRSM on zen family configurations. -Added a new field called "trsm_blkszs" to cntx structure in order to store TRSM specific block sizes. -Implemented routines to initialize, set and query the TRSM-specific block sizes. -Defined a new macro "AOCL_BLIS_ZEN" in configure script. This macro is automatically defined for zen family architectures. It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes. Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6 Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com> AMD-Internal: [CPUPL-656]	2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran	6b5c68b9ed	"Merge Selective Packing code from amd branch flame/blis" Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed	2020-08-06 10:09:28 +05:30
Kiran Varaganti	307ddc3110	Revert " Merge Selective Packing code from amd branch flame/blis" This reverts commit `e4a6af33f5`. Reason for revert: <Review not done> Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2	2020-08-06 10:09:28 +05:30
Nallani Bhaskar	447cdf5ea8	Added sup support for sgemm under zen and related frame work changes. Change-Id: Ia7e88b96d3a3617e8d24754f50db081ffe2e9955	2020-08-06 10:09:28 +05:30
Kiran Varaganti	d334452df4	Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration. This is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code clean Change-Id: I6827b58d2dab1041fe182fef5a007b679ac4bb1f	2020-08-06 10:09:10 +05:30
kdevraje	8f2677ac14	Adding threshold condition to dgemm small matrix kernels. Defining the constants in zen2 configuration Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5	2020-08-06 10:08:04 +05:30
Kiran Varaganti	87d5166b28	Merged BLIS Release 1.3 Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1 Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81	2020-08-06 10:07:34 +05:30
sraut	e082fc2b1f	Fix on EPYC machine for multi instance performance issue, Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically. Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores. Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe	2020-08-06 10:07:08 +05:30

1 2 3 4 5 ...

2174 Commits