amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-12 01:59:59 +00:00

Author	SHA1	Message	Date
Dipal M Zambare	5d617429f4	Enabled znver4 support for GCC version >= 12 - Updated zen4 configuration to add -march=znver4 flag in the compiler options if the gcc version is above or equal to 12 AMD-Internal: [CPUPL-1937] Change-Id: Ic11470b92f71e49ee193a3a5406cf6045d66bd2f	2022-07-22 12:46:15 +05:30
Dipal M Zambare	2ba2fb2b63	Add AVX2 path for TRSM+GEMM combination. - Enabled AVX2 TRSM + GEMM kernel path, when GEMM is called from TRSM context it will invoke AVX2 GEMM kernels instead of the default AVX-512 GEMM kernels. - The default context has the block sizes for AVX512 GEMM kernels, however, TRSM uses AVX2 GEMM kernels and they need different block sizes. - Added new API bli_zen4_override_trsm_blkszs(). It overrides default block sizes in context with block sizes needed for AVX2 GEMM kernels. - Added new API bli_zen4_restore_default_blkszs(). It restores The block sizes to there default values (as needed by default AVX512 GEMM kernels). - Updated bli_trsm_front() to override the block sizes in the context needed by TRSM + AVX2 GEMM kernels and restore them to the default values at the end of this function. It is done in bli_trsm_front() so that we override the context before creating different threads. AMD-Internal: [CPUPL-2225] Change-Id: Ie92d0fc40f94a32dfb865fe3771dc14ed7884c55	2022-06-29 10:16:24 +00:00
Harish	2e4ed37e97	Added missing endif for the AOCC 4.0 verion check Change-Id: I1c77ae795c398aec685152b491b838a75e7ce318	2022-06-28 13:01:53 +05:30
Harish	77e8492cbd	Added znver4 flag for config builds with AOCC 4.0 compiler version Change-Id: I45f1031ed4c5ea2e3f594713f3821d6bbbecd4df	2022-06-28 02:47:48 -04:00
Kiran Varaganti	47c344bc2e	DGEMM Benchmark Optimizations Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 dgemm kernel Change-Id: I56b3df238b6d85a6f6861448c0c6f907c972146a	2022-06-27 09:32:39 -04:00
Dipal M Zambare	c87b9aab75	Added support for AVX512 for Windows and AMAVX - Completed zen4 configuration support on windows - Enabled AVX512 kernels for AMAXV - Added zen4 configuration in amdzen for windows - Moved all zen4 kernels inside kernels/zen4 folder AMD-Internal: [CPUPL-2108] Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba	2022-06-08 11:09:48 +05:30
Dipal M Zambare	8cc15107ed	Enabled AVX-512 kernels for Zen4 config - Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for GEMM float and double types. - Enabled reference kernel for TRSM native path AMD-Internal: [CPUPL-2108] Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f	2022-06-03 06:34:35 +00:00
Dipal M Zambare	6e2f536590	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-05-17 20:35:40 +05:30
Dipal M Zambare	9c6c76613c	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2022-05-17 18:12:49 +05:30
Harsh Dave	349dcc459a	Fixed scalapack xcsep failer due to cdotxv kernel. -Failure was observed in zen configuration as gcc flag safe-math-optimization was being used for reference kernel compilation. - Optmized kernels were being compiled without this gcc flag resulted in computation difference resulting in test case failure. AMD-Internal: [CPUPL-2121] Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57	2022-05-17 18:10:40 +05:30
Harsh Dave	f17d043e1c	Implemented optimal dotxv kernel Details: - Intrinsic implementation of zdotxv, cdotxv kernel - Unrolling in multiple of 8, remaining corner cases are handled serially for zdotxv kernel - Unrolling in multiple of 16, remainig corner cases are handled serially for cdotxv kernel - Added declaration in zen contexts AMD-Internal: [CPUPL-2050] Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211	2022-05-17 18:10:39 +05:30
Arnav Sharma	caa5b37005	Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics Details: - Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics - Updated definitions zen context AMD-Internal: [CPUPL-2059] Change-Id: Ic657e4b66172ae459173626222af2756a4125565	2022-05-17 18:10:39 +05:30
Arnav Sharma	393effbb0c	Optimized ZAXPY2V using AVX2 Intrinsics Details: - Intrinsic implementation of ZAXPY2V fused kernel for AVX2 - Updated definitions in zen contexts AMD-Internal: [CPUPL-2023] Change-Id: I8889ae08c826d26e66ae607c416c4282136937fa	2022-05-17 18:08:57 +05:30
Dipal M. Zambare	b90420627a	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `62c96a4190`. Was committed without review.	2022-04-21 06:46:00 +00:00
Dipal M. Zambare	0adb525f5b	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `f816cf059f`. Was committed without review.	2022-04-21 06:45:38 +00:00
Dipal M. Zambare	f816cf059f	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108] Change-Id: Idfe3f64a037db019cbdf43318954db52ad241a51	2022-04-21 06:38:24 +00:00
Dipal M. Zambare	62c96a4190	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108]	2022-04-21 06:28:29 +00:00
Dipal M Zambare	f63f78d783	Removed Arch specific code from BLIS framework. - Removed BLIS_CONFIG_EPYC macro - The code dependent on this macro is handled in one of the three ways -- It is updated to work across platforms. -- Added in architecture/feature specific runtime checks. -- Duplicated in AMD specific files. Build system is updated to pick AMD specific files when library is built for any of the zen architecture AMD-Internal: [CPUPL-1960] Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b	2022-01-18 11:51:08 +05:30
Harsh Dave	79c6aa5643	Implemented optimal S/DCOMPLEX dotxf kernel - Optimized dotxf implementation for double and single precision complex datatype by handling dot product computation in tile 2x6 and 4x6 handling 6 columns at a time, and rows in multiple of 2 and 4. - Dot product computation is arranged such a way that multiple rho vector register will hold the temporary result till the end of loop and finally does horizontal addition to get final dot product result. - Corner cases are handled serially. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1975] Change-Id: I7dd305e73adf54100d54661769c7d5aada9b0098	2022-01-06 02:22:52 -05:00
Harsh Dave	8b5b2707c1	Optimized daxpy2v implementation - Optimized axpy2v implementation for double datatype by handling rows in mulitple of 4 and store the final computed result at the end of computation, preventing unnecessary stores for improving the performance. - Optimal and reuse of vector registers for faster computation. AMD-Internal: [CPUPL-1973] Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9	2022-01-05 06:37:22 -05:00
Arnav Sharma	3190e547b0	Optimized AXPBYV Kernel using AVX2 Intrinsics Details: - Intrinsic implementation of axpbyv for AVX2 - Bench written for axpbyv - Added definitions in zen contexts AMD-Internal: [CPUPL-1963] Change-Id: I9bc21a6170f5c944eb6e9e9f0e994b9992f8b539	2022-01-05 04:19:11 -05:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
lcpu	30038af896	Reverted: To fix accuracy issues for complex datatypes Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204	2021-11-12 08:58:57 +05:30
nphaniku	4af525a313	AOCL Windows BLIS : Windows build for dynamic dispatch library Change-Id: Ie05eafbeacbd5589b514d9353517330515104939	2021-11-12 08:58:57 +05:30
mkurumel	366ab66134	DNRM2 : Disable dnrm2 Fast math implementation. Details : - Accuracy failures observed when fast math and ILP64 are enabled. - Disabling the feature with macro BLIS_ENABLE_FAST_MATH . AMD-Internal: [CPUPL-1907] Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16	2021-11-12 08:58:56 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
Nageshwar Singh	a263146a4c	Optimized scalv for complex data-types c and z (cscalv and zscalv) AMD-Internal: [CPUPL-1551] Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9	2021-11-12 08:58:53 +05:30
mkurumel	595f7b7edf	dnrm2 optimization with dot method 1. Added new kernel bli_dnorm2fv_unb_var1 kernel to compute norm with dot operation. 2. Added vectorization to compute square of 32 double element block size from vector X. 3. Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header to compute nrm2 using new kernel. 4. Dot kernel definitions and implementation have a possibility for accuracy issues .we can switch to traditional implementation by disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm for Vector X . AMD-Internal: [CPUPL-1757] Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11	2021-11-12 08:58:53 +05:30
Dipal M Zambare	41a86c2463	Enabled znver3 flag for gcc version above 11 -- Added -march=znver3 flag if the library is built for zen3 configuration with gcc compiler version 11 or above. -- Replaced hardcoded compiler names 'gcc' and 'clang' with variable $CC so that options are chosen as per the compiler specified at configure time (instead of compiler in path). AMD-Internal: [CPUPL-1823] Change-Id: I2659349c998201ebd4480735c544e48a5ed76bb4	2021-11-12 08:58:48 +05:30
Abhiram S	7787bc79b1	Level1 samaxv: AVX512 implementation Details: 1. Unrolled by a factor 5. This gave around 1GFLOPS gain 2. Changed CMP to subs and remove nan. CMP uses a lot of compare, which is higher in latency and more number of instructions. Replacing with subs and remove nan reduced it to 3 instructions and lighter ones. 3. Added remove nan function. 4. Added AVX512 definition in skx context. 5. Disabled code in AMAXV kernel depending on AVX512 flag exists or not Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35	2021-09-27 16:10:08 +05:30
Harihara Sudhan S	bdb5e32176	Level 1 Kernel: damaxv AVX512 Details: - Developed damaxv for AVX512 extension - Implemented removeNAN function that converts NAN values to negative values based on the location - Usage COMPARE256/COMPARE128 avoided in AVX512 implementation for better performance - Unrolled the loop by order of 4. Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326	2021-09-02 09:47:08 +05:30
Dipal M Zambare	bcd9591b3f	Added support for amdepyc fat binary -- Created new configuration amdepyc to include fat binary which includes zen, zen2, zen3 and generic architecture for fallback. -- Updated amdepyc family makefiles to include macros needed in amdepyc family binary. This file must include all macros, compiler options to be used for non architecture specific code. -- Added 'workaround' to exclude ZEN family specific code in some of the framework files. There are still lot of places were ZEN family specific code is added in framework files. They will be addressed with proper design later. - Moved definition of BLIS_CONFIG_EPYC from header files to makefile so that it is enabled only for framework and kernels -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC wherever it was needed. -- Removed un-used, obsolete macros, some of them may be needed for debugging which can be added in the individual workspaces. - BLIS_DEFAULT_MR_THREAD_MAX - BLIS_DEFAULT_NR_THREAD_MAX - BLIS_ENABLE_ZEN_BLOCK_SIZES - BLIS_SMALL_MATRIX_THRES_TRSM - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES - BLIS_ENABLE_SUP_MR_EXT - BLIS_ENABLE_SUP_NR_EXT -- Corrected implementation of exiting amd64_legacy configuration. AMD-Internal: [CPUPL-1626, CPUPL-1628] Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c	2021-08-26 15:13:49 +05:30
Meghana Vankadari	6bad157754	Added a new field in cntx to store l3 threshold function pointers Details: - Adding threshold function pointers to cntx gives flexibility to choose different threshold functions for different configurations. - In case of fat binary where configuration is decided at run-time, adding threshold functions under a macro enables these functions for all the configs under a family. This can be avoided by adding function pointers to cntx which can be queried from cntx during run-time based on the config chosen. Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f	2021-08-16 00:10:01 -04:00
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Dipal M Zambare	21130ebece	Added configure option for AOCL Dynamic feature. - AOCL Dynamic feature is added in BLIS which determines optimal number of threads for the current problem size. - This feature can be enabled/disabled by modifying the source code - This change adds support to enable/disable this feature during configuration time by adding a new option in configure script AOCL-Internal : [CPUPL-1565] Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe	2021-05-12 00:41:13 -04:00
Meghana Vankadari	eea347b02e	Added dynamic threading support for GEMM SUP code path Details: - Introduced new feature called AOCL_DYNAMIC. - When this macro is defined, Optimum number of threads to solve DGEMM is estimated based on the dimensions (M,N,K). - Range of optimum number of threads will be [1, num_threads], where "num_threads" is number of threads set by the application. - Num_threads is derived from either environment variable "OMP_NUM_THREADS or BLIS_NUM_THREADS' or bli_set_num_threads() API. - Only local copy of rntm is modified by AOCL_DYNAMIC feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Optimum number of threads calculation is done only for SUP. - Since 'native' code path handles larger problem sizes, we use max number of threads recommended by the application. AMD-Internal: [CPUPL-1376] Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3	2021-05-07 09:52:51 +05:30
managalv	c420bd63e2	Enabled optimised packed routines on zen3 Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338	2021-05-06 01:25:08 -04:00
Madan mohan Manokar	c1fa9abe32	zgemm native path tuning 1. NC and MC values are tuned for both single-instance and multi-instance run. 2. zen2 and zen3 configs updated. 3. SUP path disabled for zgemm, since tuned native path performed better. To be re-enabled after setting right threshold for SUP selection. AMD-Internal: [CPUPL-1442] Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239	2021-05-06 01:15:35 -04:00
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Field G. Van Zee	bf1b578ea3	Reduced KC on skx from 384 to 256. Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.	2021-03-19 13:03:17 -05:00
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00
Meghana Vankadari	943b1362c7	Enabled vectorized pack kernels for zen2 configuration. Details: - These kernels are implemented by Field G. Van Zee as part of TRSM SUP implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a. AMD-Internal: [CPUPL-1376] Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2	2021-02-12 19:16:57 +05:30
Dipal M Zambare	38a8008cd8	Enabled znver3 flag for zen3 architecture znver3 flag will be enabled if compiler is AOCC Clang version 3.0 and configuration is zen3 Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39 AMD-Internal: [CPUPL-1353]	2020-12-04 12:28:22 +05:30
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Dipal M Zambare	c2f63fcc54	Update amd64 bundle configuration The configuration is updated to - Enable EPYC architecture optimizations - Macros to override block sizes. AMD-Internal : [CPUPL-1350] Change-Id: Id712f9abe6e81c9ece2baaab9d965b405e72977a	2020-12-01 14:37:13 +05:30
Kumar, Phani	477fc41fff	Cmake script changes and blis.h changes for amd-staging-milan-3.0 AMD Internal : [CPUPL-1083] Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580	2020-11-24 06:12:25 -05:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00

1 2 3 4 5 ...

499 Commits