* commit '81e10346':
Alloc at least 1 elem in pool_t block_ptrs. (#560)
Fix insufficient pool-growing logic in bli_pool.c. (#559)
Arm SVE C/ZGEMM Fix FMOV 0 Mistake
SH Kernel Unused Eigher
Arm SVE C/ZGEMM Support *beta==0
Arm SVE Config armsve Use ZGEMM/CGEMM
Arm SVE: Update Perf. Graph
Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
A64FX Config Use ZGEMM/CGEMM
Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
Arm SVE Add SGEMM 2Vx10 Unindexed
Arm SVE ZGEMM Support Gather Load / Scatt. St.
Arm SVE Add ZGEMM 2Vx10 Unindexed
Arm SVE Add ZGEMM 2Vx7 Unindexed
Arm SVE Add ZGEMM 2Vx8 Unindexed
Update Travis CI badge
Armv8 Trash New Bulk Kernels
Enable testing 1m in `make check`.
Config ArmSVE Unregister 12xk. Move 12xk to Old
Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
Register firestorm into arm64 Metaconfig
Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
Add test for Apple M1 (firestorm)
Firestorm CPUID Dispatcher
Armv8 GEMMSUP Edge Cases Require Signed Ints
Make error checking level a thread-local variable.
Fix data race in testsuite.
Update .appveyor.yml
Firestorm Block Size Fixes
Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
Move unused ARM SVE kernels to "old" directory.
Add an option to control whether or not to use @rpath.
Fix $ORIGIN usage on linux.
Arm micro-architecture dispatch (#344)
Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
Armv8 Fix 6x8 Row-Maj Ukr
Apply patch from @xrq-phys.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
bli_error: more cleanup on the error strings array
Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
Fix config_name in bli_arch.c
Arm Whole GEMMSUP Call Route is Asm/Int Optimized
Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
Header Typo
Arm: DGEMMSUP ??r(rv) Invoke Edge Size
Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
Arm: Implement GEMMSUP Fallback Method
Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
Added Apple Firestorm (A14/M1) Subconfig
Arm64 8x4 Kernel Use Less Regs
Armv8-A Supplimentary GEMMSUP Sizes for RD
Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
Armv8-A Adjust Types for PACKM Kernels
Armv8-A GEMMSUP-RD 6x8m
Armv8-A GEMMSUP-RD 6x8n
Armv8-A s/d Packing Kernels Fix Typo
Armv8-A Introduced s/d Packing Kernels
Armv8-A DGEMMSUP 6x8m Kernel
Armv8-A DGEMMSUP Adjustments
Armv8-A Add More DGEMMSUP
Armv8-A Add GEMMSUP 4x8n Kernel
Armv8-A Add Part of GEMMSUP 8x4m Kernel
Armv8A DGEMM 4x4 Kernel WIP. Slow
Armv8-A Add 8x4 Kernel WIP
AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.
AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
* commit 'e366665c':
Fixed stale API calls to membrk API in gemmlike.
Fixed bli_init.c compile-time error on OSX clang.
Fixed configure breakage on OSX clang.
Fixed one-time use property of bli_init() (#525).
CREDITS file update.
Added Graviton2 Neoverse N1 performance results.
Remove unnecesary windows/zen2 directory.
Add vzeroupper to Haswell microkernels. (#524)
Fix Win64 AVX512 bug.
Add comment about make checkblas on Windows
CREDITS file update.
Test installation in Travis CI
Add symlink to blis.pc.in for out-of-tree builds
Revert "Always run `make check`."
Always run `make check`.
Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced.
Update POWER10.md
Rework POWER10 sandbox
Skip clearing temp microtile in gemmlike sandbox.
Fix asm warning
Sandbox header edits trigger full library rebuild.
Add vhsubpd/vhsubpd.
Fixed bugs in cpackm kernels, gemmlike code.
Armv8A Rename Regs for Safe Darwin Compile
Armv8A Rename Regs for Clang Compile: FP32 Part
Armv8A Rename Regs for Clang Compile: FP64 Part
Asm Flag Mingling for Darwin_Aarch64
Added a new 'gemmlike' sandbox.
Updated Fugaku (a64fx) performance results.
Add explicit compiler check for Windows.
Remove `rm-dupls` function in common.mk.
Travis CI Revert Unnecessary Extras from 91d3636
Adjust TravisCI
Travis Support Arm SVE
Added 512b SVE-based a64fx subconfig + SVE kernels.
Replace bli_dlamch with something less archaic (#498)
Allow clang for ThunderX2 config
AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
thresholds and modifications to build system.
Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
enabled.
Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.
AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
- Added configuration option for zen4 architecture
- Added auto-detection of zen4 architecture
- Added zen4 configuration for all checks related
to AMD specific optimizations
AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
-- Reverted changes made to include lp/ilp info in binary name
This reverts commit c5e6f885f0.
-- Included BLAS int size in 'make showconfig'
-- Renamed amdepyc configuration to amdzen
Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
Details:
- Reworked support for ARM hardware detection in bli_cpuid.c to parse
the result of a CPUID-like instruction.
- Added a64fx support to bli_gks.c.
- #include arm64 and arm32 family headers from bli_arch_config.h.
- Fix the ordering of the "armsve" and "a64fx" strings in the
config_name string array in bli_arch.c. The ordering did not match
the ordering of the corresponding arch_t values in bli_type_defs.h,
as it should have all along.
- Added clang support to make_defs.mk in arm64, cortexa53, cortexa57
subconfigs.
- Updated arm64 and arm32 families in config_registry.
- Updated docs/HardwareSupport.md to reflect added ARM support.
- Thanks to Dave Love, RuQing Xu, and Devin Matthews for their
contributions in this PR (#344).
-- Created new configuration amdepyc to include fat binary which
includes zen, zen2, zen3 and generic architecture for fallback.
-- Updated amdepyc family makefiles to include macros needed
in amdepyc family binary. This file must include all macros,
compiler options to be used for non architecture specific code.
-- Added 'workaround' to exclude ZEN family specific code in some of
the framework files. There are still lot of places were ZEN family
specific code is added in framework files. They will be addressed
with proper design later.
- Moved definition of BLIS_CONFIG_EPYC from header files to
makefile so that it is enabled only for framework and kernels
-- Removed redundant flag AOCL_BLIS_ZEN, used BLIS_CONFIG_EPYC
wherever it was needed.
-- Removed un-used, obsolete macros, some of them may be needed for
debugging which can be added in the individual workspaces.
- BLIS_DEFAULT_MR_THREAD_MAX
- BLIS_DEFAULT_NR_THREAD_MAX
- BLIS_ENABLE_ZEN_BLOCK_SIZES
- BLIS_SMALL_MATRIX_THRES_TRSM
- BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
- BLIS_ENABLE_SUP_MR_EXT
- BLIS_ENABLE_SUP_NR_EXT
-- Corrected implementation of exiting amd64_legacy configuration.
AMD-Internal: [CPUPL-1626, CPUPL-1628]
Change-Id: I46b0ab3ea3ac7d9ff737fef66c462e85601ee29c
Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically
tuned block size by Stepan Nassyr. This subconfig also sets the sector
cache size and enables memory-tagging code in SVE gemm kernels. This
subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
blocksizes according to the analytical model. This part is ported from
Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE
at size (2*VL, 10). These kernels use unindexed FMLA instructions
because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for
size (12, k). This dpackm kernel is not currently used by any
subconfiguration.
- Implemented several experimental dgemmsup kernels which would
improve performance in a few cases. However, those dgemmsup kernels
generally underperform hence they are not currently used in any
subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
PR #424.
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
the comment contains an incorrect link, which is trivially fixed here.
@fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.
the comment contains an incorrect link, which is trivially fixed here.
@fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.
amd64 family supports all the architectures before zen.
Assigned (BLIS_ARCH_GENERIC+1) to BLIS_NUM_ARCHS in order to avoid update for every new architecture.
Change-Id: I8241e643f6dfd0ebe272e053ca8b6a9c1463d9dc
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
Implemented and registered power9 dgemm ukernel.
Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.
Formally registered power9 sub-configuration.
Details:
- Added and registered power9 sub-configuration into the build system.
Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
architecture-specific kernel set registered, and so for now the
sub-config is using the generic kernel set.
Details:
- Disabled (commented out) the arm32 and arm64 configuration families
in the config_registry file. Having a configuration family registered
only makes sense if BLIS is currently outfitted with runtime hardware
detection logic to choose the appropriate sub-configuration. That
logic is currently missing for ARM architectures, and thus having the
ARM configuration families in the configuration registry only serves
to confuse people. Thanks to Devangi Parikh for suggesting this
change.
Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
specifically written to overcome the floating-point latency for FMA
instructions on Intel Haswell-like architectures, which can issue two
FMA instructions per cycle. These ukernels happen to work fine on AMD
Zen-based architectures. However, Zen only issues one FMA per cycle,
which, while halving its floating-point throughput, gives it extra
flexibility in the design of its microkernels--namely, mr and nr can
be smaller and still overcome the floating-point latency for those
single-issue cores. A smaller value of mr and nr allows for a larger
value of kc, which may be useful in some situations. In the future,
we may write such Zen-specific microkernels to take advantage of this
additional flexibility.
Details:
- Added a new sub-configuration 'cortexa53', which is a mirror image
of cortexa57 except that it will use slightly different compiler
flags. Thanks to Mathieu Poumeyrol for making this suggestion after
discovering that the compiler flags being used by cortexa57 were
not working properly in certain OS X environments (the fix to which
is currently pending in pull request #245).
Details:
- Imported the 24x16 knl sgemm microkernel (and its corresonding spackm
kernel) from TBLIS and enabled its use in the knl sub-config. Also
Added sgemm microkernel prototype to bli_kernels_knl.h.
- Updated dgemm and dpackm microkernels from TBLIS, which included an
important change regarding the offsets array (changed from extern
declaration to static declaration/definition).
- Activated use of level-1v and -1f zen kernels in skx and knl
sub-configs.
- Removed some old macros no longer needed in bli_family_skx.h now that
libmemkind support exists in configure.
- Moved bli_avx512_macros.h to frame/include and adjusted #includes in
skx and knl kernels accordingly.
- Moved unused kernels in kernels/knl/3 to kernels/knl/3/other
directory.
- Fixed a minor bug in the 'make' output per compile when verboseness
is not turned on. The rule-generating function 'make-kernel-rule' was
previously passing in the name of the config, rather than the name of
the kernel set returned by get-config-for-kset, which could give
misleading information to the user when the kconfig_map mapped a
kernel set to a sub-configuration that did not share the same name.
(This didn't affect the CFLAGS that were actually used.)
- Updated test/3m4m/Makefile, removing acml targets and renaming the
remaining targets.
Details:
- Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration
family in the config_registry file.
- Added logic to configure that avoids committing certain sub-configs to
the configuration/kernel registries if those sub-configs cannot be
handled properly by the chosen compiler. (This was modeled after
similar logic in TBLIS's configure; thanks to Devin Matthews for
pointing this out.) First, the compiler and its version are inspected
and, based on the results, certain configurations are added to a
"blacklist". Then, as the configuration registries are being created,
configurations and/or kernels that match items in the blacklist are
skipped over and not commited to the registries. Under certain
circumstances, omitting a blacklisted configuration will indirectly
invalidate other configurations due to the loss of availability of
the original blacklisted configuration's kernel set. This additional
indirect blacklist is also accounted for.
- Added output to the beginning of configure that echos information
about the chosen compiler as well as the configurations that are
blacklisted and must be stripped from the registries.
- Various other cleanups in configure, especially with respect to
explicitly declaring local variables in functions.
- Comment updates to config/zen/make_defs.mk regarding choice of -march
flags based on compiler version.
Details:
- Updated configure so that configuration families specified in the
config_registry are no longer constrained as being only one level
deep. For example, previously the x86_64 family could not be defined
concisely in terms of, say, intel64 and amd64 families, and instead
had to be defined as containing "haswell, sandybridge, penryn, zen,
etc." In other words, families were constrained to only having
singleton configurations as their members. That constraint is now
lifted.
- Redefined x86_64 family in config_registry in terms of intel64 and
amd64.
Details:
- Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c.
Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit)
sub-configurations are supported. However, the functions that detect
features specific to a15 and a9 are identical, and since a15 is tested
first, it will always be chosen for arm32 hardware (even if both
sub-configurations were enabled at configure-time and the library is
linked and run on an a9). Thus, more work needs to be done to
distinguish these two.
- Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either
the x86_64 or ARM code will be compiled (or neither, if neither
environment is detected).
- In bli_arch_query_id(), call bli_cpuid_query_id() when the
BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined.
- Added arm64 and arm32 configuration families to config_registry.
- Added a note to the arch_t typedef enum in bli_type_defs.h reminding
the developer to update the string array in bli_arch.c whenever new
enum values are added or existing values are reordered.
Details:
- Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv,
dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf
and dotxf. This works because these kernels simply target AVX/AVX2,
and therefore work without modification on haswell hardware.
- Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen
kernels are essentially identical to those used by haswell, except that
now zen kernels are a bit more up-to-date. In the future, I may
continue to maintain duplicates, or I may keep the kernels named after
one architecture (zen or haswell) but used by both sub-configurations.
- In config_registry, enable use of both haswell and zen kernels for the
haswell sub-configuration. This is necessary in order to make zen
kernels visible when registering kernels in bli_cntx_init_haswell.c.
- Enable use of assembly-based complex gemm microkernels for zen,
bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in
bli_cntx_init_zen.c. This was actually intended for 1681333.
Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
Special thanks to AMD for their contributions to-date, especially with
regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
the extra cost of transposing the microtile in registers, this is
much faster than using the general storage case when the underlying
matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
- only one vzeroupper instruction is called prior to returning
- movapd/movupd are used in leiu of movaps/movups for double-real
microkernels. (Note that single-real microkernels still use
movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
comments.
- Updated LICENSE file to explicitly mention that parts are copyright
UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.
Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
sum of the squares via dotv(). If there is a floating-point exception
(FE_OVERFLOW), then the previous (numerically conservative) code is
used; otherwise, the result of dotv() is square-rooted and stored as
the result. This new implementation is only enabled when FE_OVERFLOW
is #defined. If the macro is not #defined, then the previous
implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.
* Add x86 and x86_64 processor families.
* Use generic config as fallback for more families.
After discussion with fgvanzee, a) it's "generic" and 2) use it for all the families as a fallback. Goal is that if a specific CPU is not yet supported by a family (say a new Intel microarchitecture on x86_64), it'll fall through to still work with the slower "generic" kernels
Details:
- Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of
bli_blksz_init_easy() that should have been bli_blksz_init().
- Fixed a bug in code that is supposed to output the list of sub-directories
in the 'config' directory when configure script is run with no arguments.
- Expanded the output of "make showconfig" to include more info from config.mk.
- Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for
someone to add excavator and zen support.
- Added a link to the ConfigurationHowTo wiki to config_registry.
- Other minor tweaks to configure.
Details:
- Added a "generic" configuration that leaves the default blocksizes and
kernels unchanged. This replaces the older "reference" configuration.
Updated auto-detect script and code accordingly.
- Added support for generic configuration to arch_t (bli_type_defs.h),
bli_gks_init() (bli_gks.c), and bli_arch_config.h
- Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h).
- Whitespace changes to configurations' make_defs.mk files.
Details:
- Reworked the build system around a configuration registry file, named
config_registry', that identifies valid configuration targets, their
constituent sub-configurations, and the kernel sets that are needed by
those sub-configurations. The build system now facilitates the building
of a single library that can contains kernels and cache/register
blocksizes for multiple configurations (microarchitectures). Reference
kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
in determining which sub-configurations (CONFIG_LIST) and kernel sets
(KERNEL_LIST) are included in the library, and which make_defs.mk files'
CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
functions into a standard format that includes the kernel set name
(e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
kernels sub-directory. These files exist to provide prototypes for the
kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
This directory includes a new source file, bli_cntx_ref.c (compiled on
a per-configuration basis), that defines the code needed to initialize
a reference context and a context for induced methods for the
microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
#defines for the config (family) name, the sub-configurations that are
associated with the family, and the kernel sets needed by those
sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
what remains to new header files named "bli_arch_<configname>.h", which
are conditionally #included from a new header bli_arch.h. These files
are still needed to set library-wide parameters such as custom
malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
The files contain a function, named the same as the file, that initializes
a "native" context for a particular configuration (microarchitecture). The
idea is that optimized kernels, if available, will be initialized into
these contexts. Other fields will retain pointers to reference functions,
which will be compiled on a per-configuration basis. These bli_cntx_init_*()
functions will be called during the initialization of the global kernel
structure. They are thought of as initializing for "native" execution, but
they also form the basis for contexts that use induced methods. These
functions are prototyped, along with their _ref() and _ind() brethren, by
prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
structures (pointers to cntx_t, actually). The first dimension is indexed
over arch_t and the inner dimension is the ind_t (induced method) for
each microarchitecture. When a microarchitecture (configuration) is
"registered" at init-time, the inner array for that configuration in the
2D array is initialized (and allocated, if it hasn't been already). The
cntx_t slot for BLIS_NAT is initialized immediately and those for other
induced method types are initialized and cached on-demand, as needed. At
cntx_t registration, we also store function pointers to cntx_init functions
that will initialize (a) "reference" contexts and (b) contexts for use with
induced methods. We don't cache the full contexts for reference contexts
since they are rarely needed. The functions that initialize these two kinds
of contexts are generated automatically for each targeted sub-configuration
from cpp-templatized code at compile-time. Induced method contexts that
need "stage" adjustments can still obtain them via functions in
bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
the level-1f, level-1v, and packm kernels, and for converting a native
context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
macros in bli_kernel_macro_defs.h to being runtime checks defined in
bli_check.c and called from bli_gks_register_cntx() at the time that the
global kernel structure's internal context is initialized for a given
microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
the previous operation-level cntx_t_init()/_finalize() invocations.
Instead, we now query the gks for a suitable context, usually via
bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
into a single kernel that will branch on the schema and support packing
to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
bli_malloc_intl(). Useful when we wish to allocate and initialize to
zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
bli_cntx.h into static functions.