amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-07-01 03:37:27 +00:00

Author	SHA1	Message	Date
Edward Smyth	8de8dc2961	Merge commit '81e10346' into amd-main * commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (#560) Fix insufficient pool-growing logic in bli_pool.c. (#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_* Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803	2024-06-25 05:48:46 -04:00
Edward Smyth	2450a1813b	BLIS: Implement zen5 sub-configuration Implement full support for zen5 as a separate BLIS sub-configuration and code path within amdzen configuration family. AMD-Internal: [CPUPL-3518] Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09	2024-04-12 07:26:31 -04:00
Edward Smyth	f5505be9f3	Merge commit 'e366665c' into amd-main * commit 'e366665c': Fixed stale API calls to membrk API in gemmlike. Fixed bli_init.c compile-time error on OSX clang. Fixed configure breakage on OSX clang. Fixed one-time use property of bli_init() (#525). CREDITS file update. Added Graviton2 Neoverse N1 performance results. Remove unnecesary windows/zen2 directory. Add vzeroupper to Haswell microkernels. (#524) Fix Win64 AVX512 bug. Add comment about make checkblas on Windows CREDITS file update. Test installation in Travis CI Add symlink to blis.pc.in for out-of-tree builds Revert "Always run `make check`." Always run `make check`. Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. Update POWER10.md Rework POWER10 sandbox Skip clearing temp microtile in gemmlike sandbox. Fix asm warning Sandbox header edits trigger full library rebuild. Add vhsubpd/vhsubpd. Fixed bugs in cpackm kernels, gemmlike code. Armv8A Rename Regs for Safe Darwin Compile Armv8A Rename Regs for Clang Compile: FP32 Part Armv8A Rename Regs for Clang Compile: FP64 Part Asm Flag Mingling for Darwin_Aarch64 Added a new 'gemmlike' sandbox. Updated Fugaku (a64fx) performance results. Add explicit compiler check for Windows. Remove `rm-dupls` function in common.mk. Travis CI Revert Unnecessary Extras from `91d3636` Adjust TravisCI Travis Support Arm SVE Added 512b SVE-based a64fx subconfig + SVE kernels. Replace bli_dlamch with something less archaic (#498) Allow clang for ThunderX2 config AMD-Internal: [CPUPL-2698] Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4	2023-10-18 09:09:54 -04:00
Meghana Vankadari	3a71550bc3	Enabling SUP blocksizes & kernels for generic config Details: - pack and compute extension APIs derive blocksizes(MR, NR...) from SUP cntx. - SUP blocksizes are not set for generic/skx configs. As a result pack and compute APIs cause floating point exceptions. - To fix these issues, we have enabled non-zero SUP blocksizes for generic config and zen4 SUP blocksizes for skx config. - However, these changes will not enable SUP path for skx/generic config as thresholds are set to zero. - To enable SUP path for skx config, more work is needed like non-zero thresholds and modifications to build system. Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9	2023-10-12 05:27:10 -04:00
Edward Smyth	85f2bf6c4a	Fix for x86_64 builds Configuration x86_64 includes all Intel and AMD sub-configurations. Fixes to enable this to work correctly again are: - In config_registry use amdzen rather than amd64 in x86_64 family. - Copy settings from config/amdzen/bli_family_amdzen.h to config/x86_64/bli_family_x86_64.h - Modify configure to set enable_aocl_zen=yes for x86_64, but not for amd64_legacy. - Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are enabled. Note: sub-configurations knl and bulldozer use instructions that are not supported on most x86_64 processors. AMD-Internal: [CPUPL-3838] Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04	2023-10-09 07:24:21 -04:00
Dipal M Zambare	8cc15107ed	Enabled AVX-512 kernels for Zen4 config - Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for GEMM float and double types. - Enabled reference kernel for TRSM native path AMD-Internal: [CPUPL-2108] Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f	2022-06-03 06:34:35 +00:00
Dipal M. Zambare	b90420627a	Revert "Enabled AVX-512 kernels for Zen4 config" This reverts commit `62c96a4190`. Was committed without review.	2022-04-21 06:46:00 +00:00
Dipal M. Zambare	62c96a4190	Enabled AVX-512 kernels for Zen4 config Enabled AVX-512 skylake kernels in zen4 configuration. AVX-512 kernels are added for float and double types. AMD-Internal: [CPUPL-2108]	2022-04-21 06:28:29 +00:00
Dipal M Zambare	fd8a3aace9	Added support for zen4 architecture - Added configuration option for zen4 architecture - Added auto-detection of zen4 architecture - Added zen4 configuration for all checks related to AMD specific optimizations AMD-Internal: [CPUPL-1937] Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a	2021-11-23 10:29:15 +05:30
Dipal M Zambare	3364c0e4eb	Binary and dynamic dispatch configuration name change -- Reverted changes made to include lp/ilp info in binary name This reverts commit `c5e6f885f0`. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6	2021-11-12 08:58:56 +05:30
RuQing Xu	a4066f278a	Register firestorm into arm64 Metaconfig	2021-10-07 02:26:05 +09:00
Devin Matthews	079fbd42ce	Merge branch 'master' into arm64-hi-bw	2021-10-04 17:21:48 -05:00
Dave Love	d0a0b4b841	Arm micro-architecture dispatch (#344 ) Details: - Reworked support for ARM hardware detection in bli_cpuid.c to parse the result of a CPUID-like instruction. - Added a64fx support to bli_gks.c. - #include arm64 and arm32 family headers from bli_arch_config.h. - Fix the ordering of the "armsve" and "a64fx" strings in the config_name string array in bli_arch.c. The ordering did not match the ordering of the corresponding arch_t values in bli_type_defs.h, as it should have all along. - Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 subconfigs. - Updated arm64 and arm32 families in config_registry. - Updated docs/HardwareSupport.md to reflect added ARM support. - Thanks to Dave Love, RuQing Xu, and Devin Matthews for their contributions in this PR (#344).	2021-10-04 13:03:04 -05:00
Dipal M Zambare	bcd9591b3f	Added support for amdepyc fat binary -- Created new configuration amdepyc to include fat binary which includes zen, zen2, zen3 and generic architecture for fallback. -- Updated amdepyc family makefiles to include macros needed in amdepyc family binary. This file must include all macros, compiler options to be used for non architecture specific code. -- Added 'workaround' to exclude ZEN family specific code in some of the framework files. There are still lot of places were ZEN family specific code is added in framework files. They will be addressed with proper design later. - Moved definition of BLIS_CONFIG_EPYC from header files to makefile so that it is enabled only for framework and kernels -- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC wherever it was needed. -- Removed un-used, obsolete macros, some of them may be needed for debugging which can be added in the individual workspaces. - BLIS_DEFAULT_MR_THREAD_MAX - BLIS_DEFAULT_NR_THREAD_MAX - BLIS_ENABLE_ZEN_BLOCK_SIZES - BLIS_SMALL_MATRIX_THRES_TRSM - BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES - BLIS_ENABLE_SUP_MR_EXT - BLIS_ENABLE_SUP_NR_EXT -- Corrected implementation of exiting amd64_legacy configuration. AMD-Internal: [CPUPL-1626, CPUPL-1628] Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c	2021-08-26 15:13:49 +05:30
RuQing Xu	e38ca28689	Added Apple Firestorm (A14/M1) Subconfig - Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.	2021-08-13 03:21:19 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Dipal M Zambare	25d23cdda2	Zen3 support, disabled IR, JR loop parallelization AMD-Internal: [CPUPL-1013] Change-Id: I859152d63d1a56519c508dfa19587f25123e08b4	2020-07-24 20:55:47 +05:30
dzambare	9c7814da1c	Added support for zen3 configuration - User can now specify zen3 configuration, currently it reuses block sizes and kernels from zen2. - Auto configuration can detect and enable if zen3 config is needed - Added support for amd64 bundle which contains all zen platforms - Moved exiting amd bundle to amd64 legacy. AMD-Internal: [CPUPL-500, CPUPL-1013] Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957	2020-07-22 18:24:26 +05:30
Jeff Hammond	dd54e792a7	fix link to docs the comment contains an incorrect link, which is trivially fixed here. @fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.	2020-05-21 11:40:00 +05:30
Jeff Hammond	60de939deb	fix link to docs the comment contains an incorrect link, which is trivially fixed here. @fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.	2020-01-01 21:30:38 -08:00
Meghana	fb75044ea2	Removed zen and zen2 configurations from amd64 family amd64 family supports all the architectures before zen. Assigned (BLIS_ARCH_GENERIC+1) to BLIS_NUM_ARCHS in order to avoid update for every new architecture. Change-Id: I8241e643f6dfd0ebe272e053ca8b6a9c1463d9dc	2019-12-03 16:48:34 +05:30
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
kdevraje	cac127182d	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id `565fa3853b`. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42	2019-06-24 14:05:54 +05:30
Kiran Varaganti	a23f92594c	config_registry: New AMD zen2 architecture configuration added. frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS] frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...). frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..). frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20 Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866	2019-05-22 05:28:16 -04:00
Nicholai Tukanov	78bc0bc8b6	Power9 sub-configuration (#298 ) Formally registered power9 sub-configuration. Details: - Added and registered power9 sub-configuration into the build system. Thanks to Nicholai Tukanov and Devangi Parikh for these contributions. - Note: The sub-configuration does not yet have a corresponding architecture-specific kernel set registered, and so for now the sub-config is using the generic kernel set.	2019-02-14 13:29:02 -06:00
Field G. Van Zee	adf5c17f08	Formally registered thunderx2 subconfiguration. Details: - Added a separate subconfiguration for thunderx2, which now uses different optimization flags than cortexa57/cortexa53.	2019-01-18 15:14:45 -06:00
Field G. Van Zee	c534da62c0	Disabled ARM configuration families in registry. Details: - Disabled (commented out) the arm32 and arm64 configuration families in the config_registry file. Having a configuration family registered only makes sense if BLIS is currently outfitted with runtime hardware detection logic to choose the appropriate sub-configuration. That logic is currently missing for ARM architectures, and thus having the ARM configuration families in the configuration registry only serves to confuse people. Thanks to Devangi Parikh for suggesting this change.	2018-12-05 15:51:05 -06:00
Field G. Van Zee	3c52725693	Renamed/moved l3 zen ukernels to haswell kernel set. Details: - Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and then updated the file contents to use the 'haswell' infix. - Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to above function renames. - Moved/updated the corresponding prototypes in bli_kernels_zen.h to bli_kernels_haswell.h. - Updated config_registry according to above changes. - NOTE: This rename reflects the fact that haswell microkernels are specifically written to overcome the floating-point latency for FMA instructions on Intel Haswell-like architectures, which can issue two FMA instructions per cycle. These ukernels happen to work fine on AMD Zen-based architectures. However, Zen only issues one FMA per cycle, which, while halving its floating-point throughput, gives it extra flexibility in the design of its microkernels--namely, mr and nr can be smaller and still overcome the floating-point latency for those single-issue cores. A smaller value of mr and nr allows for a larger value of kc, which may be useful in some situations. In the future, we may write such Zen-specific microkernels to take advantage of this additional flexibility.	2018-10-17 14:56:22 -05:00
Ye Luo	6722ec2181	Fix bgclang compilation on BGQ (#270 ) * Fix bgq kernels * Support bgq with bgclang	2018-10-17 11:26:00 -05:00
Field G. Van Zee	fb81c7fc66	Defined cortexa53 sub-configuration. Details: - Added a new sub-configuration 'cortexa53', which is a mirror image of cortexa57 except that it will use slightly different compiler flags. Thanks to Mathieu Poumeyrol for making this suggestion after discovering that the compiler flags being used by cortexa57 were not working properly in certain OS X environments (the fix to which is currently pending in pull request #245).	2018-09-06 16:29:39 -05:00
Field G. Van Zee	60366a3fab	Updates to knl kernels and related code. Details: - Imported the 24x16 knl sgemm microkernel (and its corresonding spackm kernel) from TBLIS and enabled its use in the knl sub-config. Also Added sgemm microkernel prototype to bli_kernels_knl.h. - Updated dgemm and dpackm microkernels from TBLIS, which included an important change regarding the offsets array (changed from extern declaration to static declaration/definition). - Activated use of level-1v and -1f zen kernels in skx and knl sub-configs. - Removed some old macros no longer needed in bli_family_skx.h now that libmemkind support exists in configure. - Moved bli_avx512_macros.h to frame/include and adjusted #includes in skx and knl kernels accordingly. - Moved unused kernels in kernels/knl/3 to kernels/knl/3/other directory. - Fixed a minor bug in the 'make' output per compile when verboseness is not turned on. The rule-generating function 'make-kernel-rule' was previously passing in the name of the config, rather than the name of the kernel set returned by get-config-for-kset, which could give misleading information to the user when the kconfig_map mapped a kernel set to a sub-configuration that did not share the same name. (This didn't affect the CFLAGS that were actually used.) - Updated test/3m4m/Makefile, removing acml targets and renaming the remaining targets.	2018-04-16 18:46:21 -05:00
Field G. Van Zee	786d15c5ef	Added skx, knl to x86_64 configuration family. Details: - Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration family in the config_registry file. - Added logic to configure that avoids committing certain sub-configs to the configuration/kernel registries if those sub-configs cannot be handled properly by the chosen compiler. (This was modeled after similar logic in TBLIS's configure; thanks to Devin Matthews for pointing this out.) First, the compiler and its version are inspected and, based on the results, certain configurations are added to a "blacklist". Then, as the configuration registries are being created, configurations and/or kernels that match items in the blacklist are skipped over and not commited to the registries. Under certain circumstances, omitting a blacklisted configuration will indirectly invalidate other configurations due to the loss of availability of the original blacklisted configuration's kernel set. This additional indirect blacklist is also accounted for. - Added output to the beginning of configure that echos information about the chosen compiler as well as the configurations that are blacklisted and must be stripped from the registries. - Various other cleanups in configure, especially with respect to explicitly declaring local variables in functions. - Comment updates to config/zen/make_defs.mk regarding choice of -march flags based on compiler version.	2018-04-04 16:06:47 -05:00
Field G. Van Zee	290dd4a9fe	Allow arbitrarily deep configuration families. Details: - Updated configure so that configuration families specified in the config_registry are no longer constrained as being only one level deep. For example, previously the x86_64 family could not be defined concisely in terms of, say, intel64 and amd64 families, and instead had to be defined as containing "haswell, sandybridge, penryn, zen, etc." In other words, families were constrained to only having singleton configurations as their members. That constraint is now lifted. - Redefined x86_64 family in config_registry in terms of intel64 and amd64.	2018-03-14 13:15:37 -05:00
Field G. Van Zee	1a3031740f	Updates to ARM hardware detection support. Details: - Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c. Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit) sub-configurations are supported. However, the functions that detect features specific to a15 and a9 are identical, and since a15 is tested first, it will always be chosen for arm32 hardware (even if both sub-configurations were enabled at configure-time and the library is linked and run on an a9). Thus, more work needs to be done to distinguish these two. - Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either the x86_64 or ARM code will be compiled (or neither, if neither environment is detected). - In bli_arch_query_id(), call bli_cpuid_query_id() when the BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined. - Added arm64 and arm32 configuration families to config_registry. - Added a note to the arch_t typedef enum in bli_type_defs.h reminding the developer to update the string array in bli_arch.c whenever new enum values are added or existing values are reordered.	2018-03-13 16:04:40 -05:00
Field G. Van Zee	34862aed89	Use zen kernels in haswell sub-configuration. Details: - Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv, dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf and dotxf. This works because these kernels simply target AVX/AVX2, and therefore work without modification on haswell hardware. - Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen kernels are essentially identical to those used by haswell, except that now zen kernels are a bit more up-to-date. In the future, I may continue to maintain duplicates, or I may keep the kernels named after one architecture (zen or haswell) but used by both sub-configurations. - In config_registry, enable use of both haswell and zen kernels for the haswell sub-configuration. This is necessary in order to make zen kernels visible when registering kernels in bli_cntx_init_haswell.c. - Enable use of assembly-based complex gemm microkernels for zen, bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in bli_cntx_init_zen.c. This was actually intended for `1681333`.	2018-02-28 15:30:14 -06:00
Field G. Van Zee	16813335bd	Merge branch 'amd' into rt Details: - Merged contributions made by AMD via 'amd' branch (see summary below). Special thanks to AMD for their contributions to-date, especially with regard to intrinsic- and assembly-based kernels. - Added column storage output cases to microkernels in bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with the extra cost of transposing the microtile in registers, this is much faster than using the general storage case when the underlying matrix is column-stored. - Added s and d assembly-based zen gemmtrsm_u microkernel (including column storage optimization mentioned above). - Updated zen sub-configuration to reflect presence of new native kernels. - Temporarily reverted zen sub-configuration's level-3 cache blocksizes to smaller haswell values. - Temporarily disabled small matrix handling for zen configuration family in config/zen/bli_family_zen.h. - Updated zen CFLAGS according to changes in `1e4365b`. - Updated haswell microkernels such that: - only one vzeroupper instruction is called prior to returning - movapd/movupd are used in leiu of movaps/movups for double-real microkernels. (Note that single-real microkernels still use movaps/movups.) - Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is now included via frame/include/bli_arch_config.h. - Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation in testsuite/src/test_amaxv.c). - Added early return for alpha == 0 in bli_dotxv_ref.c. - Integrated changes from `f07b176`, including a fix for undefined behavior when executing the 1m method under certain conditions. - Updated config_registry; no longer need haswell kernels for zen sub-configuration. - Tweaked marginal and pass thresholds for dotxf. - Reformatted level-1v, -1f, and -3 amd kernels and inserted additional comments. - Updated LICENSE file to explicitly mention that parts are copyright UT-Austin and AMD. - Added AMD copyright to header templates in build/templates. Summary of previous changes from 'amd' branch. - Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and s and d assembly-based zen gemmtrsm_l microkernels (d6x8). - Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv, and scalv, with extra-unrolling variants for axpyv and scalv. - Added a small matrix handler to bli_gemm_front(), with the handler implemented in kernels/zen/3/bli_gemm_small_matrix.c. - Added additional logic to sumsqv that first attempts to compute the sum of the squares via dotv(). If there is a floating-point exception (FE_OVERFLOW), then the previous (numerically conservative) code is used; otherwise, the result of dotv() is square-rooted and stored as the result. This new implementation is only enabled when FE_OVERFLOW is #defined. If the macro is not #defined, then the previous implementation is used. - Added axpyv and dotv standalone test drivers to test directory. - Added zen support to old cpuid_x86.c driver in build/auto-detect/old. - Added thread-local and __attribute__-related macros to bli_macro_defs.h.	2018-02-21 17:43:32 -06:00
dnp	4423e33dc5	Adding SKX kernels and configuration.	2017-12-06 16:35:03 -06:00
iotamudelta	b7ca580618	[WIP] Add x86 and x86_64 processor families. (#154 ) * Add x86 and x86_64 processor families. * Use generic config as fallback for more families. After discussion with fgvanzee, a) it's "generic" and 2) use it for all the families as a fallback. Goal is that if a specific CPU is not yet supported by a family (say a new Intel microarchitecture on x86_64), it'll fall through to still work with the slower "generic" kernels	2017-11-18 13:56:05 -06:00
Field G. Van Zee	d5bf79e50b	Miscellaneous tweaks and fixes. Details: - Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of bli_blksz_init_easy() that should have been bli_blksz_init(). - Fixed a bug in code that is supposed to output the list of sub-directories in the 'config' directory when configure script is run with no arguments. - Expanded the output of "make showconfig" to include more info from config.mk. - Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for someone to add excavator and zen support. - Added a link to the ConfigurationHowTo wiki to config_registry. - Other minor tweaks to configure.	2017-11-13 14:24:29 -06:00
Field G. Van Zee	07c352188b	Added "generic" configuration. Details: - Added a "generic" configuration that leaves the default blocksizes and kernels unchanged. This replaces the older "reference" configuration. Updated auto-detect script and code accordingly. - Added support for generic configuration to arch_t (bli_type_defs.h), bli_gks_init() (bli_gks.c), and bli_arch_config.h - Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h). - Whitespace changes to configurations' make_defs.mk files.	2017-10-23 16:59:22 -05:00
Field G. Van Zee	453deb2906	Implemented runtime kernel management. Details: - Reworked the build system around a configuration registry file, named config_registry', that identifies valid configuration targets, their constituent sub-configurations, and the kernel sets that are needed by those sub-configurations. The build system now facilitates the building of a single library that can contains kernels and cache/register blocksizes for multiple configurations (microarchitectures). Reference kernels are also built on a per-configuration basis. - Updated the Makefile to use new variables set by configure via the config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP, in determining which sub-configurations (CONFIG_LIST) and kernel sets (KERNEL_LIST) are included in the library, and which make_defs.mk files' CFLAGS (KCONFIG_MAP) are used when compiling kernels. - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel functions into a standard format that includes the kernel set name (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each kernels sub-directory. These files exist to provide prototypes for the kernels present in those directories. - Reorganized reference kernels into a top-level 'ref_kernels' directory. This directory includes a new source file, bli_cntx_ref.c (compiled on a per-configuration basis), that defines the code needed to initialize a reference context and a context for induced methods for the microarchitecture in question. - Rewrote make_defs.mk files in each configuration so that the compiler variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration basis. - Modified bli_config.h.in template so that bli_config.h is generated with #defines for the config (family) name, the sub-configurations that are associated with the family, and the kernel sets needed by those sub-configurations. - Deprecated all kernel-related information in bli_kernel.h and transferred what remains to new header files named "bli_arch_<configname>.h", which are conditionally #included from a new header bli_arch.h. These files are still needed to set library-wide parameters such as custom malloc()/free() functions or SIMD alignment values. - Added bli_cntx_init_<configname>.c files to each configuration directory. The files contain a function, named the same as the file, that initializes a "native" context for a particular configuration (microarchitecture). The idea is that optimized kernels, if available, will be initialized into these contexts. Other fields will retain pointers to reference functions, which will be compiled on a per-configuration basis. These bli_cntx_init_() functions will be called during the initialization of the global kernel structure. They are thought of as initializing for "native" execution, but they also form the basis for contexts that use induced methods. These functions are prototyped, along with their _ref() and _ind() brethren, by prototype-generating macros in bli_arch.h. - Added a new typedef enum in bli_type_defs.h to define an arch_t, which identifies the various sub-configurations. - Redesigned the global kernel structure (gks) around a 2D array of cntx_t structures (pointers to cntx_t, actually). The first dimension is indexed over arch_t and the inner dimension is the ind_t (induced method) for each microarchitecture. When a microarchitecture (configuration) is "registered" at init-time, the inner array for that configuration in the 2D array is initialized (and allocated, if it hasn't been already). The cntx_t slot for BLIS_NAT is initialized immediately and those for other induced method types are initialized and cached on-demand, as needed. At cntx_t registration, we also store function pointers to cntx_init functions that will initialize (a) "reference" contexts and (b) contexts for use with induced methods. We don't cache the full contexts for reference contexts since they are rarely needed. The functions that initialize these two kinds of contexts are generated automatically for each targeted sub-configuration from cpp-templatized code at compile-time. Induced method contexts that need "stage" adjustments can still obtain them via functions in bli_cntx_ind_stage.c. - Added new functions and functionality to bli_cntx.c, such as for setting the level-1f, level-1v, and packm kernels, and for converting a native context into one for executing an induced method. - Moved the checking of register/cache blocksize consistency from being cpp macros in bli_kernel_macro_defs.h to being runtime checks defined in bli_check.c and called from bli_gks_register_cntx() at the time that the global kernel structure's internal context is initialized for a given microarchitecture/configuration. - Deprecated all of the old per-operation bli__cntx.c files and removed the previous operation-level cntx_t_init()/_finalize() invocations. Instead, we now query the gks for a suitable context, usually via bli_gks_query_cntx(). - Deprecated support for the 3m2 and 3m3 induced methods. (They required hackery that I was no longer willing to support.) - Consolidated the 1e and 1r packm kernels for any given register blocksize into a single kernel that will branch on the schema and support packing to both formats. - Added the cntx_t* argument to all packm kernel signatures. - Deprecated the local function pointer array in all bli_packm_cxk*.c files and instead obtain the packm kernel from the cntx_t. - Added bli_calloc_intl(), which serves as the calloc-equivalent to to bli_malloc_intl(). Useful when we wish to allocate and initialize to zero/NULL. - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h, bli_cntx.h into static functions.	2017-10-18 13:29:32 -05:00

44 Commits