amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Nallani Bhaskar	3a2e4c3db8	Added optimized single threaded dtrsm small for left cases Details: 1. Added optimized dtrsm kernels for all 8 left side cases Below are few notable optimizations which improved performance a. Loading, transposing (for transa cases), packing and reusing of a10 block required for GEMM operation. The block size increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM from one end of A to other end of triangular A b. Performing inregister transpose whenever required c. Packing of 8 diagonal elements in one location helped to utilize cache line efficiently 2. Enabled calling dtrsm small for smaller sizes at cblas level itself to avoid frame work overhead, which is significant for very small sizes 3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun and manideep.kurumella@amd.com for implementing lut kernels using intrinsics. 4. Removed all older implementations of strsm which are not developed as per the guide lines, can be refered from older releases if required. Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6	2021-05-18 16:16:00 +05:30
Nageshwar Singh	a88cb82cec	Revert "Adding trans h support in bench_gemm.c" This reverts commit `791903b31c`. Change-Id: I24403cced67ea9e851adb58a8bf01a3e17bb4e85	2021-05-07 04:11:30 -04:00
Meghana Vankadari	dc2d6ee763	Moved dynamic threading function from GEMMT to GEMM Details: - Current tuning for choosing optimal number of threads is done for GEMM. - Dynamic thread calculation function was placed in gemmt code flow instead of gemm by mistake. Fixing it with this commit. AMD-Internal: [CPUPL-1376] Change-Id: Iccb42a7a617b9b4cdb4c4af9be21aa82aaaabbcc	2021-05-07 12:10:53 +05:30
Meghana Vankadari	33ddf2e448	Fixed blastest failure for haswell configuration Details: - Placed optimized version of BLAS DGEMM, ZGEMM definitions under BLIS_CONFIG_EPYC as they use gemm small which are defined only for zen family configurations. - Added code to query and set cntx in gemv and trsv framework before cntx is referred for any function pointers to avoid querying from NULL pointer. AMD-Internal: [CPUPL-1562] Change-Id: I977d028ec4ddb57dcdc70e443e7708f36c01cca9	2021-05-07 01:49:54 -04:00
Meghana Vankadari	eea347b02e	Added dynamic threading support for GEMM SUP code path Details: - Introduced new feature called AOCL_DYNAMIC. - When this macro is defined, Optimum number of threads to solve DGEMM is estimated based on the dimensions (M,N,K). - Range of optimum number of threads will be [1, num_threads], where "num_threads" is number of threads set by the application. - Num_threads is derived from either environment variable "OMP_NUM_THREADS or BLIS_NUM_THREADS' or bli_set_num_threads() API. - Only local copy of rntm is modified by AOCL_DYNAMIC feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Optimum number of threads calculation is done only for SUP. - Since 'native' code path handles larger problem sizes, we use max number of threads recommended by the application. AMD-Internal: [CPUPL-1376] Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3	2021-05-07 09:52:51 +05:30
Kiran Varaganti	433f17b6cd	bench_gemmt Bug Fix Fix reading input parameters Interchange the reading of n and k, first n appears and then k appears in the logs. Added comments to explain the format of the input gemmt log. Change-Id: I44c6081d4449ba210728bc089c4215d5eef18834	2021-05-06 14:54:15 +05:30
managalv	c420bd63e2	Enabled optimised packed routines on zen3 Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338	2021-05-06 01:25:08 -04:00
Madan mohan Manokar	c1fa9abe32	zgemm native path tuning 1. NC and MC values are tuned for both single-instance and multi-instance run. 2. zen2 and zen3 configs updated. 3. SUP path disabled for zgemm, since tuned native path performed better. To be re-enabled after setting right threshold for SUP selection. AMD-Internal: [CPUPL-1442] Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239	2021-05-06 01:15:35 -04:00
Dipal Madhukar Zambare	821fa267c9	Merge "Updated makefiles to fix issues introduced in merge" into amd-staging-milan-3.1	2021-05-05 23:42:15 -04:00
Meghana Vankadari	dc71602895	Merge "Added sup functionality for SYRK" into amd-staging-milan-3.1	2021-05-05 06:26:03 -04:00
Dipal M Zambare	7454cca9e7	Updated makefiles to fix issues introduced in merge - Updated Makefile to include DTL files in library build - Updated Makefile to include cpp header file installation - Updated test/makefile to include extra API added by AMD team. AMD-Internal: [CPUPL-1559] Change-Id: I249c6935d5ff5fb645f9deec7e0218575484be13	2021-05-05 14:59:15 +05:30
Nallani Bhaskar	f917d826b5	Updated test application to work with row major cblas Details: 1. Fixed reading leading dimenstions in test_gemm.c based on row/col major 2. Reduced redundent code and adjusted alignment Change-Id: I8ca8c81223386fc21c6cc7c1d8f8a2109c9f5343	2021-05-02 23:09:13 +05:30
Meghana Vankadari	1303732e83	Added sup functionality for SYRK Details: - Added bli_syrksup function that internally uses gemmt implementation. - Modified OAPI of syrk to call SUP before proceeding to the conventional implementation. - Copied gemmsup threshold function for syrk temporarily. Thresholds are yet to be derived for syrk. Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d	2021-04-29 12:35:30 +05:30
Nallani Bhaskar	b239a5aee7	Bug fix in sgemmsup 1x16n kernel Details: Address increment was missing in bli_sgemmsup_rv_zen_asm_1x16 kernel while storing output in column major order in beta zero case JIRA: CPUPL-1548 Change-Id: I36269cd28de6fbef2256451e399f90f0437b0ce1	2021-04-28 21:33:30 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Satish Kumar Nuggu	743732c939	Merge "Added CBLAS API interface and memory alignment check" into amd-staging-milan-3.1	2021-04-26 04:27:21 -04:00
satish kumar nuggu	c0d16f70e4	Added CBLAS API interface and memory alignment check Details: 1. Added CBLAS API in test_gemm.c and test_trsm.c 2. Creating matrixes with memory aligned leading dimensions irrespective of dimension. 3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end. AMD-Internal: [CPUPL-1500] [CPUPL-1450] Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a	2021-04-26 13:44:30 +05:30
Field G. Van Zee	6a4aa986ff	Fixed typo in Table of Contents.	2021-04-23 13:10:01 -05:00
Field G. Van Zee	f6424b5b82	Added dedicated Performance section to README.md. Details: - Spun off the Performance.md and PerformanceSmall.md links in the Documentation section into a new Performance section dedicated to those two links. (The previous entries remain redundantly listed within Documentation section.) Thanks to Robert van de Geijn for suggesting this change.	2021-04-23 13:08:06 -05:00
Madan mohan Manokar	f6088ac1cf	Enabling 3m_sqp and 3m1 methods 1. Re-enabling 3m methods for zgemm. 2. Vectorization of pack_sum routines re-enabled with bug fix. 3. 8mx6n kernel added. AMD-Internal: [CPUPL-1352] Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810	2021-04-22 02:47:31 -04:00
Meghana Vankadari	d68f427ced	Merge "Added bench app for gemmt - input is a log file generated from AOCL DTL" into amd-staging-milan-3.1	2021-04-22 00:20:16 -04:00
Devin Matthews	40ce5fd241	Merge pull request #493 from cassiersg/patch-1 Fix typo in FAQ.md	2021-04-21 09:54:25 -05:00
Gaëtan Cassiers	1f3461a5a5	Fix typo in FAQ.md	2021-04-21 16:49:05 +02:00
Mangala V	9e147912ee	Merge "ZGEMM SUP: Removed unused assembly intructions" into amd-staging-milan-3.1	2021-04-19 03:08:31 -04:00
managalv	6c3741cd3e	ZGEMM SUP: Removed unused assembly intructions Removed memory operations which were being unused Modified labels to be unquie to a file Rowstride update is done at once to avoid multiple mul instruction AMD Internal : [CPUPL-1419] Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103	2021-04-19 20:31:03 +05:30
Meghana Vankadari	713ca659b5	Added bench app for gemmt - input is a log file generated from AOCL DTL Change-Id: Ia3390b529244f529d9741c86a6f8dc35a589f714	2021-04-19 09:40:24 +05:30
Manideep Kurumella	3c2c8157c9	Merge "SGEMV performance improvement." into amd-staging-milan-3.1	2021-04-12 01:22:48 -04:00
mkurumel	f8525a888e	SGEMV performance improvement. 1.bli_sdotxf_zen_int_8 : added hadd_ps intrinsic instead of dp_ps for add partial dot outputs. AMD Internal : [CPUPL-1512] Change-Id: I6e8e71a9cf8c1f30a1710dd1c67f193a998beb03	2021-04-12 10:47:23 +05:30
Madan mohan Manokar	997133dc11	sup zgemm improvement 1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m. 2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done. Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e AMD-Internal: [CPUPL-1352]	2021-04-09 01:45:04 -04:00
Field G. Van Zee	6280757be3	Minor updates to a64fx section of Performance.md.	2021-04-07 13:03:56 -05:00
RuQing Xu	1e6ed823c6	Additional A64fx Comments (#490 ) * Performance.md Update A64fx Comments - Reason for ARMPL's missing data; - Additional envs / flags for kernel selection; - Update BLIS SRC commit. * Include Another Fix in armsve-cfg-vendor A prototype was forgotten, causing that void* pointer was not fully returned.	2021-04-07 12:59:26 -05:00
Field G. Van Zee	2688f21a5b	Added Fujitsu A64fx (512-bit SVE) perf results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on the "Fugaku" Fujitsu A64fx supercomputer at the RIKEN Center for Computational Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan Nassyr for their work in developing and optimizing A64fx support in BLIS and RuQing for gathering the performance data that is reflected in these new graphs.	2021-04-06 19:02:37 -05:00
Field G. Van Zee	ba3ba8da83	Minor updates and fixes to test/3/octave scripts. Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m.	2021-04-06 18:39:58 -05:00
Kiran Varaganti	a7b7fcae59	Merge "Fix test_dotv.c when complex-return=intel" into amd-staging-milan-3.1	2021-04-06 10:01:45 -04:00
Kiran Varaganti	657f58b82b	Fix test_dotv.c when complex-return=intel When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes. Now "return parameter" will become the first argument of these functions and these functions return void. This fix addresses the change in the function declarations when --complex-return=intel is enabled. Missing Trace and Log statements for this configuration are now added. [CPUPL-1376] Change-Id: Ib420989da71839211c16088bf431a2ad775a3978	2021-04-06 15:38:12 +05:30
Madan Mohan Manokar	2aad3fbe55	Merge "disabled zgemm induced and gemm sqp temporarily." into amd-staging-milan-3.1	2021-04-05 05:51:12 -04:00
Madan mohan Manokar	7112b73d0d	disabled zgemm induced and gemm sqp temporarily. 1. mx1, mx4 kernel addition and framework modification. 2. 8mx6n kernel addition. 3. NULL check added in dgemm_sqp malloc. 4. mem tracing added. 5. Restricted 3m_sqp to limited matrix sizes. 6. Induced methods disabled temporarily for debug. AMD-Internal: [CPUPL-1352] Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2	2021-04-04 20:43:03 +05:30
Devin Matthews	90508192f2	Update do_sde.sh (#489 ) Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.	2021-03-30 21:16:44 -05:00
Nicholai Tukanov	22c6b5dc4c	Fixed bug in power10 microkernel I/O. (#488 ) Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.)	2021-03-30 19:07:42 -05:00
Dipal M Zambare	5562a27823	Added check for zero dimensions and early return in ?gemv and ?scal API. If the one of the passed dimensions is zero, these API's will perform early return and avoid crashes in case any other pointer inputs are null. AMD-Internal: [SWLCSG-602] Change-Id: Ibe8902beef286410707a2a88e94b933b49975c85	2021-03-26 21:08:22 +05:30
Field G. Van Zee	159ca6f01a	Made test/3/octave scripts robust to missing data. Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m.	2021-03-24 15:57:32 -05:00
Field G. Van Zee	545e6c2f6d	CHANGELOG update (0.8.1)	2021-03-22 17:42:33 -05:00
Field G. Van Zee	8535b3e11d	Version file update (0.8.1)	2021-03-22 17:42:33 -05:00
Field G. Van Zee	e56d9f2d94	ReleaseNotes.md update in advance of next version.	2021-03-22 17:40:50 -05:00
Field G. Van Zee	ca83f955d4	CREDITS file update.	2021-03-22 17:21:21 -05:00
Field G. Van Zee	57ef61f6cd	Merge branch 'master' of github.com:flame/blis	2021-03-19 13:05:43 -05:00
Field G. Van Zee	bf1b578ea3	Reduced KC on skx from 384 to 256. Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.	2021-03-19 13:03:17 -05:00
Nicholai Tukanov	e7a4a8edc9	Fix calculation of new pb size (#487 ) Details: - Added missing parentheses to the i8 and i4 instantiations of the GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.	2021-03-17 19:43:31 -05:00
Field G. Van Zee	4493cf516e	Redefined BLIS_NUM_ARCHS to update automatically. Details: - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum value in the arch_t enum. This means that it no longer needs to get updated manually whenever new subconfigurations are added to BLIS. Also removed the explicit initial index assigment of 0 from the first enum value, which was unnecessary due to how the C language standard mandates indexing of enum values. Thanks to Devin Matthews for originally submitting this as a PR in #446. - Updated docs/ConfigurationHowTo.md to reflect the aforementioned change.	2021-03-15 13:12:49 -05:00
Field G. Van Zee	a4b73de84c	Disabled _self() and _equal() in bli_pthread API. Details: - Disabled the _self() and _equal() extensions to the bli_pthread API introduced in d479654. These functions were disabled after I realized that they aren't actually needed yet. Thanks to Devin Matthews for helping me reason through the appropriate consumer code that will appear in BLIS (eventually) in a future commit. (Also, I could never get the Windows branch to link properly in clang builds in AppVeyor. See the comment I left in the code, and #485, for more info.)	2021-03-12 19:47:39 -06:00

1 2 3 4 5 ...

2450 Commits