Commit Graph

218 Commits

Author SHA1 Message Date
Nageshwar Singh
dbd7b28373 Development of AVX2 axpyv kernels for c and z datatypes.
Details
    - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
    - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.

AMD-Internal: [CPUPL-1231]

Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
2020-10-23 09:33:35 +05:30
Nageshwar Singh
b245ea9c65 cblas_?cabs1 test cblas header file bug
Details:
    - Added BLIS_ENABLE_CBLAS around cblas header file.

AMD-Internal: [CPUPL-1129]

Change-Id: I3baacd26aa96c8eeb753d95210817ffe9b2a3f85
2020-10-13 01:43:55 -04:00
Meghana Vankadari
016885348c Added CBLAS interface and test file for gemm_batch API
AMD-Internal: [CPUPL-1184]
Change-Id: Icc5c41429b0d92f1a66a955769cc0518ca4706ee
2020-10-07 06:55:08 -04:00
Kiran Varaganti
aa56c36b82 Fixed Logs & code cleanup
Fixed AOCL DTL logs printing incorrect alpha and beta values for single
precision. Added missing info like data-types, lower or upper traingular and
Side parameter in the case of TRSM.
Code cleanup and formatting the files test_cabs1.c, test_axpbyv.c and
test_gemm.c. In dumping trsm parameters replaced 'side' with bli_is_right(side).

Change-Id: Ic81503ae696956eb074ec208f7109d1a394183d7
2020-10-04 22:17:31 +05:30
Nageshwar Singh
5243da5cec BLIS: CBLAS Extensions. cblas_?cabs1 : Absolute value of a complex number cabs1
Details:
   - added cblas extension cblas_?cabs1.
   - Functionality : res=|Re(z)|+|Im(z)|, z is a complex number, and res is a      value containing the absolute value of a complex number z.

AMD-Internal: [CPUPL-1129]

Change-Id: I4a3c265c89527c8fd3060c5d2ed38b1953ce6343
2020-09-23 19:01:02 +05:30
Meghana Vankadari
80828f6fda Added few changes in test_axpbyv.c file
Details:
- Corrected "#if" directive in line 89
- Commented out "#define print" to disable printing the vectors

Change-Id: I9ec3cbfb716540dd3e2264f5c3925d9e0c0c294a
2020-09-16 23:48:01 -04:00
Kiran Varaganti
5a8bd9f41c Enable BLAS interface in BLIS
Modified Makefile in test folder to enable calling BLAS interfaces for BLIS as
well. This is possible by replacing -DBLIS with -DBLAS=\"aocl\" in the
makefile. Also added linking to multi-threaded MKL library.

Change-Id: Iccf2ec99b48bb35da985b69218bc680f678ff7c9
2020-09-16 23:26:44 +05:30
Meghana Vankadari
6c9bf36424 Added BLAS and CBLAS interfaces for axpby API
Details:
- The axpby routines perform a vector-vector operation defined as
  y = a*x + b*y where a, b are scalars and x, y are vectors.
- This API is part of BLAS-like extension APIs

Change-Id: I17a53b03bba97de7ae1995a9f086084bd241bcdc
AMD-Internal: [CPUPL-1118]
2020-09-16 09:49:31 +05:30
dzambare
267a959af1 Rebased amd-staging-milan-3.0 branch on master
-- Rebased on top of master commit # 6e522e5823
  -- Updated merged code to remove duplicated code added by auto-merging
  -- Updated merged code to rename bool_t type
  -- Updated merged code to rename bli_thread_obarrier
  -- Updated merged code to rename bli_thread_obroadcast

Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c
AMD-Internal: [CPUPL-1067]
2020-08-06 10:09:29 +05:30
phakumar
c7a914411f BLIS library porting on to Windows:
GEMMT changes porting on to Windows

AMD Internal : [CPUPL-1061]

Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a
2020-08-06 10:09:29 +05:30
Field G. Van Zee
26cd966af7 Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.

Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
2020-08-06 10:09:28 +05:30
Field G. Van Zee
2839b88f00 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-08-06 10:09:28 +05:30
Field G. Van Zee
068072b6b4 Updated BLASFEO results in PerformanceSmall.md.
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
  using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
  accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
  containing the mpnpkp results do not need to be preprocessed in order
  to plot half the problem size range (ie: up to 400 instead of the
  800 range of the other shape cases).
- Trivial updates to runme.m.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
50c0ffe390 Added BLASFEO results to docs/PerformanceSmall.md.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
  and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
  docs/Performance.md.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
89105695cc Added sup performance graphs/document to 'docs'.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
  publishes new performance graphs for Kaby Lake and Epyc showcasing
  the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
  For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
  graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
  in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
  document.
2020-08-03 11:48:43 +05:30
Field G. Van Zee
889b90888f Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2020-08-03 11:48:42 +05:30
Field G. Van Zee
58bede1460 Link to Eigen BLAS for non-gemm drivers in test/3.
Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
  Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
  this since Eigen's headers don't define implementations to the
  standard BLAS APIs.
- Simplified #included headers in hemm, herk, trmm, and trsm source
  driver files, since nothing specific to Eigen is needed at
  compile-time for those operations.
2020-08-03 11:47:40 +05:30
Field G. Van Zee
e6ba82f11c Add more support for Eigen to drivers in test/3.
Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
  EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
  library is not necessary.) However, as of Eigen 3.3.7, Eigen only
  parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
  any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
  (bli_does_trans()) was being called to determine whether the object
  for matrix A should be created for a left- or right-side case. This
  was corrected by changing the function to bli_is_left(), as is done
  in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.
2020-08-03 11:47:40 +05:30
Field G. Van Zee
8db5abba14 Updates to 3m4m/matlab scripts.
Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
  pasting function invocations into matlab to generate plots that are
  presently of interest to us.
2020-08-03 11:37:41 +05:30
Chithra Sankar
7834ee191d test folder files reverted to previous commit 2020-08-03 11:32:37 +05:30
Chithra Sankar
1052413e10 Return typename corrected in dot function 2020-08-03 11:31:56 +05:30
Chithra Sankar
ea563b2259 Code Cleanup done; Test code updated to add performance measurement
Change-Id: I639f22659c22226fbd81e1669e4372f200ab5129
2020-08-03 11:31:14 +05:30
Field G. Van Zee
fd5db714f4 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-08-03 11:27:13 +05:30
Field G. Van Zee
6019cb8e23 Fail-safe updates to Makefiles in 'test' dir.
Details:
- Updated Makefiles in test, test/3, and test/sup so that running any of
  the usual targets without having first built BLIS results in a helpful
  error message. For example, if BLIS is not yet configured, make will
  output:

    Makefile:327: *** Cannot proceed: config.mk not detected! Run
    configure first.  Stop.

  Similarly, if BLIS is configured but not yet built, make will output:

    Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
    make first.  Stop.

  In previous commits, these actions would result in a rather cryptic
  make error such as:

    make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
    needed by 'blis-nat-st'.  Stop.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
908887358e Support ldims, packing in sup/test drivers.
Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
  building matrices with small or large leading dimensions, and updated
  runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
  default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
  environment variables), and for capturing output to files that encode
  both the leading dimension (small or large) and packing status into
  the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
  into test/sup/octave and updated the octave code in that consolidated
  directory to read the new output filename format (encoding ldim and
  packing). Also added comments and streamlined code, particularly in
  plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.
2020-08-03 11:22:33 +05:30
Field G. Van Zee
967ddf2847 Updated sup performance graphs; added mt results.
Details:
- Reran all existing single-threaded performance experiments comparing
  BLIS sup to other implementations (including the conventional code
  path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
  showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
  (Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
2020-08-03 11:11:52 +05:30
Field G. Van Zee
573d99d05e Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.
2020-08-03 11:11:15 +05:30
Field G. Van Zee
a20b5e3c60 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
2020-08-03 11:10:38 +05:30
Meghana Vankadari
6a0a65ee23 Added sup kernels and code path for gemmt similar to GEMM.GEMMT now also supports complex data types.
Details:
- Added framework code for GEMMT SUP.
- Implemented SUP for GEMMT using similar techniques as native path.
- Moved update routines to frame/util folder.
- Ported update routines for complex datatypes.

Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>,
	       Dipal M Zambare <DipalMadhukar.Zambare@amd.com>,
	       Mangala V <managala.v@amd.com>
2020-07-13 16:26:32 +05:30
Meghana Vankadari
f59d4befb5 Added framework support and interface APIs for GEMMT
Details:
- Added new API Which Computes a matrix-matrix product with general matrices
  but updates only the upper or lower triangular part of the result matrix. 
  cblas_?gemmt() and ?gemmt_().
- These routines are similar to the ?gemm routines, but they only access
  and update a triangular part of the square result matrix.
- Added DGEMMT functionality by reusing GEMM kernels.
- Created a new folder for GEMMT under l3, and added GEMMT specific
  framework code.
- Modified cntl_create routine to choose different macro kernel for
  GEMMT.
- Added routines to copy lower/upper triangular part of a block to the
  buffer.
- Defined BLIS, BLAS and CBLAS interface APIs for GEMMT.
- Added test_gemmt.c to test folder and Updated the Makefile.
- Added a macro 'CBLAS' in test_gemm.c to call CBLAS APIs.

Change-Id: Ie00c1a15b9c654b65c687a9ca781cbc6f9641791
2020-07-06 00:51:16 -04:00
phakumar
ccf0772d6e BLIS library porting on to Windows:
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
 AMD internal:[CPUPL-657]

Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
2020-06-16 18:29:00 +05:30
Field G. Van Zee
3597284b9d Updates, tweaks to runme.sh in test/1m4m.
Details:
- Made several updates to test/1m4m/runme.sh, including:
  - Added missing handling for 1m and 4m1a implementations when setting
    the BLIS_??_NT environment variables.
  - Added support for using numactl to run the test executables.
  - Several other cleanups.
2020-05-21 11:54:53 +05:30
Field G. Van Zee
c7faae9442 Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.

Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
2020-05-21 11:50:19 +05:30
Field G. Van Zee
6d369532e3 Updated sup[mt] Makefiles for variable dim ranges.
Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
  different problem size ranges for the drivers where one, two, or three
  matrix dimensions is large. This will facilitate the generation of
  more meaningful graphs, particularly when two dimensions are tiny.
2020-05-21 11:46:36 +05:30
Field G. Van Zee
2096f41aa6 Updates to octave scripts in test/sup[mt]/octave.
Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
  with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
  which were not only unnecessary but also causing issues with versions
  5.x.
2020-05-21 11:46:35 +05:30
Devrajegowda, Kiran
6f33fd6aac Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API
Details:
    -Kernel is called directly from API call to avoid framework
     overhead in case of single and double precisions.
    -Currently these changes are applicable only for zen2 configuration.
     They will be enabled for zen family processors in future.
    -These changes improve performance of BLAS and CBLAS interfaces of API.
     They do not affect BLIS-specific APIs.
    -setv simd kernel is added for single and double precision elements

Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a
    Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
    AMD-Internal: [CPUPL-817]
2020-05-13 01:51:48 -04:00
Devrajegowda, Kiran
4caee59466 Adding a simd kernel for copyv function
Details:
    - Separate kernel for copyv function added to improve performance.
    - Modified cntx_init file in zen and zen2 configuration
    - Added test_copyv.c in test folder
    - Modified test/Makefile to include test_copyv.c

Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 01:55:25 -04:00
Meghana
b846059bcf Added opt kernels for SWAPV
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
 SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c

Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
2020-04-20 01:21:44 -05:00
Meghana
b5fe75e104 Closing input and output files in test_gemm.c and test_trsm.c
Change-Id: I75cdd5adc2bd2dac7d0eca9c050e06dbd52bec26
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-24 09:09:58 +05:30
Meghana Vankadari
ddcb3d8a52 Modified test_trsm.c file in test folder to read input sizes from a file
Details:
-A Macro 'FILE_IN_OUT' is defined to read matrix dimensions and strides from a csv file.
 Format for input file if 'FILE_IN_OUT' is defined:
 Each line defines a TRSM problem with the following parameters: m n cs_a cs_b
 The operation implemented by default is AX=B where A is lower-triangular and matrices are in column-major order.
 When macro is disabled, it reverts back to original implementation.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv
-A macro 'READ_ALL_PARAMS_FROM_FILE' is defined to read all the parameters for TRSM from a csv file.
 This macro can be defined only when 'FILE_IN_OUT' is already defined.
 Format for the input file if 'READ_ALL_PARAMS_FROM_FILE' is defined:
 Each line defines a TRSM problem with the following paramenters: sideA uploA transA diagA m n cs_a cs_b
 By default, column-major order is chosen as storage scheme for matrices.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv

Change-Id: I349bc69ca968911c16e04d1ce70974d01e65a2fb
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-20 07:15:17 -04:00
Field G. Van Zee
1a284828d1 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-03-13 01:09:29 -04:00
Devrajegowda, Kiran
b074c5e09c Added a macro MATRIX_INITIALISATION for matrix initialisation in test application
Change-Id: I8e5c9902e603a549218d4e8509a481288792266d
2019-12-01 13:12:02 +05:30
Devrajegowda, Kiran
c4047e491a Merge branch 'amd-blis-nov-mergetest' into amd-staging-rome2.1
Change-Id: I1e04592dd9494faa34555008dd1edbca8a092a44
2019-11-29 23:01:51 +05:30
Devrajegowda, Kiran
85fa9e4107 resolved merge conflicts when merged with public repo master branch
Change-Id: Iad6ba809680ba5081cc9d7879794ef58cc8f8a40
2019-11-25 14:46:48 +05:30
Field G. Van Zee
c84391314d Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
  directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
2019-11-04 13:57:12 -06:00
Nicholai Tukanov
b426f9e04e POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel 
  assumes that elements of B have been duplicated/broadcast during the
  packing step. The microkernel uses a column orientation for its 
  microtile vector registers and thus implements column storage and 
  general stride IO cases. (A row storage IO case via in-register
  transposition may be added at a future date.) It should be noted that 
  we recommend using this microkernel with gcc and *not* xlc, as issues 
  with the latter cropped up during development, including but not 
  limited to slightly incompatible vector register mnemonics in the GNU 
  extended inline assembly clobber list.
2019-11-01 17:57:03 -05:00
Kiran Varaganti
97a4236c82 Matrices are not initialized when inputs dimensions are fed through file, now these are fixed. test_gemm.c contains matrices initialized for file-based inputs as well.
Change-Id: I4c3625a51dcbf64c99f56f354dcd898e66035cb1
2019-10-24 13:57:55 +05:30
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
Field G. Van Zee
80e6c10b72 Added reproduction section to Performance docs.
Details:
- Added section titled "Reproduction" to both Performance.md and
  PerformanceSmall.md that briefly nudges the motivated reader in the
  right direction if he/she wishes to run the same performance
  benchmarks used to produce the graphs shown in those documents.
  Thanks to Dave Love for making this suggestion.
2019-08-29 12:12:08 -05:00
Field G. Van Zee
b02e0aae8c Updated test drivers to iterate backwards.
Details:
- Updated test driver source in test, test/3, test/1m4m, and
  test/mixeddt to iterate through the problem space backwards. This
  can help avoid certain situations where the CPU frequency does not
  immediately throttle up to its maximum. Thanks to Robert van de
  Geijn for recommending this fix (originally made to test/sup drivers
  in 57e422a).
- Applied off-by-one matlab output bugfix from b6017e5 to test drivers
  in test, test/3, test/1m4m, and test/mixeddt directories.
2019-08-27 14:37:46 -05:00