Commit Graph

2155 Commits

Author SHA1 Message Date
bhaskarn
d186cfdf2e CPUPL-1074:
- Bug fix in sgemmsup 1x16 Kernel for Beta Zero and with C col storage
       rcx register incrementing was missing because of this 4 values
       in output are overwritten

Change-Id: Ia3028040dce3e615f1db5a331498d86faadcf916
2020-08-11 01:26:26 -04:00
Dipal M Zambare
7bbcae5a18 Fixed build issue in cpp testsuite.
This issue was caused by incorrect merging of cpp and testcpp files.

Change-Id: Idc40fbdaa55b6052a6a061d2d3e5cfae76b99916
AMD-Internal: [CPUPL-1067]
2020-08-10 12:17:09 +05:30
dzambare
3177db4888 Updated version number.
Change-Id: Iba3659b04f2d85ec7dc008ceb84da73c7c66530a
2020-08-06 15:00:46 +05:30
dzambare
f30c3c7766 mend
Change-Id: Iba3659b04f2d85ec7dc008ceb84da73c7c66530a
2020-08-06 14:29:24 +05:30
dzambare
267a959af1 Rebased amd-staging-milan-3.0 branch on master
-- Rebased on top of master commit # 6e522e5823
  -- Updated merged code to remove duplicated code added by auto-merging
  -- Updated merged code to rename bool_t type
  -- Updated merged code to rename bli_thread_obarrier
  -- Updated merged code to rename bli_thread_obroadcast

Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c
AMD-Internal: [CPUPL-1067]
2020-08-06 10:09:29 +05:30
phakumar
c7a914411f BLIS library porting on to Windows:
GEMMT changes porting on to Windows

AMD Internal : [CPUPL-1061]

Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a
2020-08-06 10:09:29 +05:30
Mangala V
5b8c2bc9e2 Revert "CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed"
This reverts commit 725bf5aceb.

Reason for revert: <INSERT REASONING HERE>

Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8
2020-08-06 10:09:29 +05:30
managalv
2a0928aad4 CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed
Details:
-  Problem:
	If row major, first four elements of last column on output matrix C was not updated
	If col major, first four elements of last row on output matrix C was not updated
- Solution:
	Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8()

Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae
2020-08-06 10:09:29 +05:30
Dipal M Zambare
4f69332879 Added testsuite for gemmt APIs.
The testsuite coveres all combinations of upper, lower, transpose and API formats.

AMD Internal: [CPUPL-1021]

Change-Id: I2a1d79eba1dcaf4217fd9c2c346bd6173b80a782
2020-08-06 10:09:29 +05:30
Meghana Vankadari
4b8a84b9bc Added some optimizations for gemmt default path
Details:
- If there are any zero rows or columns along the edges of MCxNC block
of C, shrink the dimensions to avoid "no-op" iterations.
- For lower-triangle kernel variant, Added a flag to determine if a
  block that is strictly below triangle is reached. Once such block is
  reached, the flag is set and  all the blocks that are below it are
  strictly below the diagonal and flag is used to make decision.
- For upper-triangle kernel-variant, whenever a block that is strictly
  below the triangle is reached, break the for loop and go for next
  iteration of JR loop because all the blocks below it will also be
  strictly below diagonal and are filled with zeroes which requires
  no computation.

Change-Id: I606b0f900509aab6ed7ff30cefee9d7207b7b010
2020-08-06 10:09:29 +05:30
Meghana Vankadari
9f28a28cbf Added support to handle col-major storage of C in SUP kernel
Details:
- Unlike default path, storage scheme of C is not always row-major in
  SUP.
- Whenever C is col-major, the temporary buffer 'ct' is also chosen to be col-major.
- Since update routines only support row-major order, a transpose
  is induced for c and ct buffers before passing them to update routine.

Change-Id: I3fea10860f39632df7540c9399786e7aa1cfba37
2020-08-06 10:09:28 +05:30
dzambare
536edc4088 Added support for zen3 configuration
- User can now specify zen3 configuration,
      currently it reuses block sizes and kernels from zen2.
    - Auto configuration can detect and enable if zen3 config is needed
    - Added support for amd64 bundle which contains all zen platforms
    - Moved exiting amd bundle to amd64 legacy.

AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
2020-08-06 10:09:28 +05:30
Dipal M Zambare
bee39108d1 Added support for logging gemm input values.
Added BLIS specific extension to AOCL DTL, in this
added support to print the input matrix sizes from BLIS
library.

AMD Internal: [CPUPL-806]

Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561
2020-08-06 10:09:28 +05:30
dzambare
48b89f50e4 Checking for zero dimension is moved to bli_gemm_xx call.
This will ensure early return in case full gemm processing is not needed.

Based on dimension which is found to be zero following actions will be taken:

If 'c' has zero dimension, no further processing is requried
If alpha is zero or if 'a' or 'b' has zero diemension, we
perform scalm operation instead of gemm. (c = alpha*a + beta*b)

Change-Id: Icc031944fc4e80138adf991974547f2d57ab570b
AMD-Internal: [CPUPL-904]
2020-08-06 10:09:28 +05:30
Field G. Van Zee
88ac67a8cf Renamed bli_thread_obarrier(), _obroadcast().
Details:
- Renamed two bli_thread_*() APIs:
    bli_thread_obarrier()   -> bli_thread_barrier()
    bli_thread_obroadcast() -> bli_thread_broadcast()
  The 'o' was a leftover from when thrcomm_t objects tracked both
  "inner" and "outer" communicators. They have long since been
  simplified to only support the latter, and thus the 'o' is
  superfluous.

Change-Id: If9ec9a2383dfb02e1cfc74918f87a1fabddbd55b
2020-08-06 10:09:28 +05:30
Field G. Van Zee
26cd966af7 Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.

Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
2020-08-06 10:09:28 +05:30
Meghana Vankadari
01e1a41c95 Optimized DGEMV kernel and changed BLAS interface call
Details:
- Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of dgemv to remove dependency on cntx variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for DGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
  They will be enabled for zen family processors in future.
- Changed naming convention for new BLAS macros to indicate their use.
- Added new optimized kernel for axpyf under zen2 folder.
- Implemented basic GEMV kernel without using axpyv or axpyf.
  This kernel is chosen for small sizes.

Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-885]
2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran
013d6a2cab Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API
Details:
    -Kernel is called directly from API call to avoid framework
     overhead in case of single and double precisions.
    -Currently these changes are applicable only for zen2 configuration.
     They will be enabled for zen family processors in future.
    -These changes improve performance of BLAS and CBLAS interfaces of API.
     They do not affect BLIS-specific APIs.
    -setv simd kernel is added for single and double precision elements

Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a
    Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
    AMD-Internal: [CPUPL-817]
2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran
0ecc147f03 Adding a simd kernel for copyv function
Details:
    - Separate kernel for copyv function added to improve performance.
    - Modified cntx_init file in zen and zen2 configuration
    - Added test_copyv.c in test folder
    - Modified test/Makefile to include test_copyv.c

Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-08-06 10:09:28 +05:30
Meghana
aee0376b16 Added opt kernels for SWAPV
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
 SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c

Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
2020-08-06 10:09:28 +05:30
Meghana
560bf86166 Made some critical changes to small_gemm kernels
Details:
- In case of GEMM, whenever beta is zero, we need to perform C = alpha
*(A * B) instead of C = beta * C + alpha * (A * B)
 Added conditions to check the value of beta at different levels inside
 small_gemm kernels and decide whether to perform scaling C with beta or
 not.
-Modified small_gemm kernels to use BLIS specific functions to retrieve
 different fields of objects.
-Calling bli_gemm_check before entering bli_gemm_small to facilitate
 early return in case of invalid inputs.
-For corner cases inside small_gemm kernels, a buffer called f_temp
 is used to load and store data to and from registers.
 populating the buffer with zeroes before use.
-In bli_gemm_front, datatypes of status and return value from
 bli_gemm_small are not matching.
 Corrected the datatype of the variable 'status' inside bli_gemm_front
 to err_t.

Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-689,SWLCSG-104]
2020-08-06 10:09:28 +05:30
Field G. Van Zee
cec95ea3a9 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.

Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
2020-08-06 10:09:28 +05:30
Field G. Van Zee
2839b88f00 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-08-06 10:09:28 +05:30
Meghana Vankadari
a03a0f8f70 Made framework changes to initialize specific cache block sizes for TRSM.
Details:
-This commit addresses the performance optimization(single-thread and
 multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
 what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
 on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
 store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
 block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
 This macro is automatically defined for zen family architectures.
 It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.

Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran
6b5c68b9ed "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2020-08-06 10:09:28 +05:30
Kiran Varaganti
307ddc3110 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2020-08-06 10:09:28 +05:30
Nallani Bhaskar
447cdf5ea8 Added sup support for sgemm under zen and related frame work changes.
Change-Id: Ia7e88b96d3a3617e8d24754f50db081ffe2e9955
2020-08-06 10:09:28 +05:30
Kiran Varaganti
d334452df4 Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration.
This is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code clean

Change-Id: I6827b58d2dab1041fe182fef5a007b679ac4bb1f
2020-08-06 10:09:10 +05:30
kdevraje
8f2677ac14 Adding threshold condition to dgemm small matrix kernels.
Defining the constants in zen2 configuration

Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5
2020-08-06 10:08:04 +05:30
Kiran Varaganti
87d5166b28 Merged BLIS Release 1.3
Modified config/zen/make_defs.mk, now CKVECFLAGS     := -mavx2 -mfpmath=sse -mfma -march=znver1

Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81
2020-08-06 10:07:34 +05:30
sraut
e082fc2b1f Fix on EPYC machine for multi instance performance issue,
Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically.
Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores.

Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe
2020-08-06 10:07:08 +05:30
sraut
ea64906393 small matrix trsm intrinsics optimization code for AX=B and XA'=B
Change-Id: I90123c4d9adbd314c867995cd19dc975150b448c
2020-08-03 11:53:03 +05:30
Field G. Van Zee
068072b6b4 Updated BLASFEO results in PerformanceSmall.md.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
  using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
  accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
  containing the mpnpkp results do not need to be preprocessed in order
  to plot half the problem size range (ie: up to 400 instead of the
  800 range of the other shape cases).
- Trivial updates to runme.m.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
50c0ffe390 Added BLASFEO results to docs/PerformanceSmall.md.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
  and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
  docs/Performance.md.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
c873e206b3 CHANGELOG update (0.6.0) 2020-08-03 11:48:43 +05:30
Field G. Van Zee
89105695cc Added sup performance graphs/document to 'docs'.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
  publishes new performance graphs for Kaby Lake and Epyc showcasing
  the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
  For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
  graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
  in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
  document.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
b4e085c254 Minor build system housekeeping.
Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
  level Makefiles. This variable is already set within common.mk, and
  so the only time it should be overridden is if the user wants to link
  to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.
2020-08-03 11:48:42 +05:30
Field G. Van Zee
889b90888f Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2020-08-03 11:48:42 +05:30
Field G. Van Zee
ba8fde137c Updated Eigen results in docs/graphs with 3.3.90.
Details:
- Updated the level-3 performance graphs in docs/graphs with new Eigen
  results, this time using a development version cloned from their git
  mirror on March 27, 2019 (version 3.3.90). Performance is improved
  over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
- Very minor updates to docs/Performance.md and matlab scripts in
  test/3/matlab.
2020-08-03 11:48:06 +05:30
Field G. Van Zee
1bc2002706 Added Eigen results to performance graphs.
Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
  docs/graphs to report on Eigen implementations, where applicable.
  Specifically, Eigen implements all level-3 operations sequentially,
  however, of those operations it only provides multithreaded gemm.
  Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
  omitted. Thanks to Sameer Agarwal for his help configuring and
  using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.
2020-08-03 11:48:05 +05:30
Field G. Van Zee
58bede1460 Link to Eigen BLAS for non-gemm drivers in test/3.
Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
  Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
  this since Eigen's headers don't define implementations to the
  standard BLAS APIs.
- Simplified #included headers in hemm, herk, trmm, and trsm source
  driver files, since nothing specific to Eigen is needed at
  compile-time for those operations.
2020-08-03 11:47:40 +05:30
Field G. Van Zee
e6ba82f11c Add more support for Eigen to drivers in test/3.
Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
  EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
  library is not necessary.) However, as of Eigen 3.3.7, Eigen only
  parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
  any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
  (bli_does_trans()) was being called to determine whether the object
  for matrix A should be created for a left- or right-side case. This
  was corrected by changing the function to bli_is_left(), as is done
  in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.
2020-08-03 11:47:40 +05:30
Field G. Van Zee
7dced00166 CHANGELOG update (0.5.2) 2020-08-03 11:47:40 +05:30
Field G. Van Zee
ec6776a1b5 Added docs/Performance.md and docs/graphs subdir.
Details:
- Added a new markdown document, docs/Performance.md, which reports
  performance of a representative set of level-3 operations across a
  variety of hardware architectures, comparing BLIS to OpenBLAS and a
  vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
  in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.
2020-08-03 11:47:40 +05:30
Field G. Van Zee
5904e18d40 Updates to docs/Multithreading.md.
Details:
- Made extra explicit the fact that: (a) multithreading in BLIS is
  disabled by default; and (b) even with multithreading enabled, the
  user must specify multithreading at runtime in order to observe
  parallelism. Thanks to M. Zhou for suggesting these clarifications
  in #292.
- Also made explicit that only the environment variable and global
  runtime API methods are available when using the BLAS API. If the
  user wishes to use the local runtime API (specify multithreading on
  a per-call basis), one of the native BLIS APIs must be used.
2020-08-03 11:47:18 +05:30
Field G. Van Zee
ecf05352f2 Annotated additional symbols for export.
Details:
- Added export annotations to additional function prototypes in order to
  accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
  test_amaxv.c.
2020-08-03 11:47:18 +05:30
Field G. Van Zee
2337bcb926 Support shared lib export of only public symbols.
Details:
- Introduced a new configure option, --enable-export-all, which will
  cause all shared library symbols to be exported by default, or,
  alternatively, --disable-export-all, which will cause all symbols to
  be hidden by default, with only those symbols that are annotated for
  visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
  symbols), to be exported. The default for this configure option is
  --disable-export-all. Thanks to Isuru Fernando for consulting on
  this commit.
- Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
  which was intended for 5a5f494.
- Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
  frame/include/bli_config_macro_defs.h.
- Provided appropriate logic within common.mk to implement variable
  symbol visibility for gcc, clang, and icc (to the extend that each of
  these compilers allow).
- Relocated --help text associated with debug option (-d) to configure
  slightly further down in the list.
2020-08-03 11:47:18 +05:30
Field G. Van Zee
e4d07f93db Removed export macros from all internal prototypes.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
  BLIS_EXPORT_BLIS from all function prototypes *except* those that we
  potentially wish to be exported in shared/dynamic libraries. In other
  words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
  functions that can be considered private or for internal use only.
  This is likely the last big modification along the path towards
  implementing the functionality spelled out in issue #248. Thanks
  again to Isuru Fernando for his initial efforts of sprinkling the
  export macros throughout BLIS, which made removing them where
  necessary relatively painless. Also, I'd like to thank Tony Kelman,
  Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
  participating in the initial discussion in issue #37 that was later
  summarized and restated in issue #248.
- CREDITS file update.
2020-08-03 11:47:18 +05:30
Isuru Fernando
f381c9a440 Export functions without def file (#303)
* Revert "restore bli_extern_defs exporting for now"

This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.

* Remove symbols not intended to be public

* No need of def file anymore

* Fix whitespace

* No need of configure option

* Remove export macro from definitions

* Remove blas export macro from definitions
2020-08-03 11:46:07 +05:30
Isuru Fernando
ceda852482 Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2020-08-03 11:42:15 +05:30