Commit Graph

2104 Commits

Author SHA1 Message Date
prangana
ba3299cd97 checkcpp test rule in Makefile
Change-Id: If01fe55e258e563a96cd8da9ea93d21063b730c2
2020-08-03 11:34:58 +05:30
Chithra Sankar
7834ee191d test folder files reverted to previous commit 2020-08-03 11:32:37 +05:30
Chithra Sankar
1052413e10 Return typename corrected in dot function 2020-08-03 11:31:56 +05:30
Chithra Sankar
ea563b2259 Code Cleanup done; Test code updated to add performance measurement
Change-Id: I639f22659c22226fbd81e1669e4372f200ab5129
2020-08-03 11:31:14 +05:30
Field G. Van Zee
c04966262b Mention disabling of sup in docs/Sandboxes.md.
Details:
- Added language to remind the reader to disable sup if the intended
  behavior is for the sandbox implementation to handle all problem
  sizes, even the smaller ones that would normally be handled by the
  sup code path.
2020-08-03 11:27:13 +05:30
Field G. Van Zee
fd5db714f4 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-08-03 11:27:13 +05:30
Field G. Van Zee
14d95a2183 Redefined bool_t typedef in terms of C99 bool.
Details:
- Changed the typedef that defines bool_t from:

    typedef gint_t bool_t;

  where gint_t is a signed integer that forms the basis of most other
  integers in BLIS, to:

    typedef bool bool_t;

- Changed BLIS's TRUE and FALSE macro definitions from being in terms of
  integer literals:

    #define TRUE  1
    #define FALSE 0

  to being in terms of C99 boolean constants:

    #define TRUE  true
    #define FALSE false

  which are provided by stdbool.h.
- This commit constitutes the second phase of a transition toward using
  C99's bool instead of bool_t, which will address issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
6019cb8e23 Fail-safe updates to Makefiles in 'test' dir.
Details:
- Updated Makefiles in test, test/3, and test/sup so that running any of
  the usual targets without having first built BLIS results in a helpful
  error message. For example, if BLIS is not yet configured, make will
  output:

    Makefile:327: *** Cannot proceed: config.mk not detected! Run
    configure first.  Stop.

  Similarly, if BLIS is configured but not yet built, make will output:

    Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
    make first.  Stop.

  In previous commits, these actions would result in a rather cryptic
  make error such as:

    make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
    needed by 'blis-nat-st'.  Stop.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
8b9257df67 Cleaned up bool_t usage and various typecasts.
Details:
- Fixed various typecasts in

    frame/base/bli_cntx.h
    frame/base/bli_mbool.h
    frame/base/bli_rntm.h
    frame/include/bli_misc_macro_defs.h
    frame/include/bli_obj_macro_defs.h
    frame/include/bli_param_macro_defs.h

  that were missing or being done improperly/incompletely. For example,
  many return values were being typecast as
    (bool_t)x && y
  rather than
    (bool_t)(x && y)
  Thankfully, none of these deficiencies had manifested as actual bugs
  at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
  This reflects the fact that bli_env_get_var() needs to be able to
  return a signed integer, and even though dim_t is currently defined
  as a signed integer, it does not intuitively appear to necessarily be
  signed by inspection (i.e., an integer named "dim_t" for matrix
  "dimension"). Also, updated use of bli_env_get_var() within
  bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
  and added comments to the bli_thrcomm_*.h files that will explain a
  planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
  'bool' for 'bool_t', which will eliminate the namespace conflict with
  arm_sve.h as reported in issue #420. This commit implements the first
  phase of that transition. Thanks to RuQing Xu for reporting this
  issue.
- CREDITS file update.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
5cbdbe495f Replaced broken ref99 sandbox w/ simpler version.
Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
  API changes over the last two years. Rather than try to fix it, I've
  replaced it with a much simpler version based on var2 of gemmsup.
  Why not fix the previous implementation? It occurred to me that the
  old implementation was trying to be a lightly simplified duplication
  of what exists in the framework. Duplication aside, this sandbox
  would have worked fine if it had been completely independent of the
  framework code. The problem was that it was only partially
  independent, with many function calls calling a function in BLIS
  rather than a duplicated/simplified version within the sandbox. (And
  the reason I didn't make it fully independent to begin with was that
  it seemed unnecessarily duplicative at the time.) Maintaining two
  versions of the same implementation is problematic for obvious
  reasons, especially when it wasn't even done properly to begin with.
  This explains the reimplementation in this commit. The only catch is
  that the newer implementation is single-threaded only and does not
  perform any packing on either input matrix (A or B). Basically, it's
  only meant to be a simple placeholder that shows how you could plug
  in your own implementation. Thanks to Francisco Igual for reporting
  this brokenness.
- Updated the three reference gemmsup kernels (defined in
  ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
  conjugation of conja and/or conjb. The general storage kernel, which
  is currently identical to the column-storage kernel, is used in the
  new ref99 sandbox to provide basic support for all datatypes
  (including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
  and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
  sandbox implementation is based).
2020-08-03 11:23:40 +05:30
Giorgos Margaritis
7db89fe91d Update Multithreading.md 2020-08-03 11:23:40 +05:30
Field G. Van Zee
4f5b014c05 Added missing rv_d?x6 edge cases to sup kernel.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
  various n = 6 edge cases with a single sup kernel call. Previously,
  only n = {4,2,1} were handled explicitly as single kernel calls;
  that is, cases where n = 6 were previously being executed via two
  kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
3eef698711 Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
  BLIS_INLINE instead of the static keyword. This allows blis.h to
  use a different keyword (inline) to define these functions when
  compiling with C++, which might otherwise trigger "defined but
  not used" warning messages. Thanks to Giorgos Margaritis for
  reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
  hardware auto-detection facility, to unconditionally #define
  BLIS_INLINE to the static keyword (since we know BLIS will be
  compiled with C, not C++):
    build/detect/config/config_detect.c
    frame/base/bli_arch.c
    frame/base/bli_cpuid.c
- CREDITS file update.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
b1144a856b Added -fomit-frame-pointer option to CKOPTFLAGS.
Details:
- Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS
  variable in the following make_defs.mk files:
    config/haswell/make_defs.mk
    config/skx/make_defs.mk
  as well as comments that mention why the compiler option is needed.
  This option is needed to prevent the compiler from using the rbp
  frame register (in the very early portion of kernel code, typically
  where k_iter and k_left are defined and computed), which, as of
  1c719c9, is used explicitly by the gemmsup millikernels. Thanks to
  Devin Matthews for identifying this missing option and to Jeff
  Diamond for reporting the original bug in #417.
- The file
    config/zen/amd_config.mk
  which feeds into the make_defs.mk for both zen and zen2 subconfigs,
  was also touched, but only to add a commented-out compiler option
  (and the aforementioned explanatory comment) since that file already
  uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of
  CKOPTFLAGS.
2020-08-03 11:22:33 +05:30
Field G. Van Zee
98a744fd44 Fixed disabled edge case optimization in gemmsup.
Details:
- Fixed an inadvertently disabled edge case optimization in the two
  gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
  optimizations allow the last millikernel operation in the jr loop to
  be executed with inflated an register blocksize if it is the last
  (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
  problem is m=8, n=100, k=100. (In this case, the panel-block variant
  (var1n) is executed, which places the jr loop in the m dimension.)
  In principle, this problem could be executed as two millikernels: one
  with dimensions 6x100x100, and one as 2x100x100. However, with the
  support for inflated blocksizes in the kernel, the entire 8x100x100
  problem can be passed to the millikernel function, which will then
  execute it more favorably as two 4x100x100 millikernel sub-calls.
  Now, this optimization is disabled under certain circumstances, such
  as when multithreading. Previously, the is_mt predicate was being set
  incorrectly such that it was non-zero even when running
  single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
  moved so that the result of the optimization could have an impact on
  the assignment of loop bounds ranges to threads.
2020-08-03 11:22:33 +05:30
Field G. Van Zee
908887358e Support ldims, packing in sup/test drivers.
Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
  building matrices with small or large leading dimensions, and updated
  runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
  default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
  environment variables), and for capturing output to files that encode
  both the leading dimension (small or large) and packing status into
  the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
  into test/sup/octave and updated the octave code in that consolidated
  directory to read the new output filename format (encoding ldim and
  packing). Also added comments and streamlined code, particularly in
  plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.
2020-08-03 11:22:33 +05:30
Field G. Van Zee
2d7a43d7ef Fixed incorrect link to shiftd in BLISTypedAPI.md.
Details:
- Previously, the entry for shiftd in the Operation index section of
  BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
  in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
  helping find this incorrect link.
2020-08-03 11:22:32 +05:30
Field G. Van Zee
e4d16bd3d6 CREDITS file update. 2020-08-03 11:22:32 +05:30
Isuru Fernando
51c36e8019 Expand windows instructions (#414)
* Expand windows instructions

* Windows: both static and shared don't work at the same time
2020-08-03 11:22:32 +05:30
Isuru Fernando
c7f9684384 FIx typo in FAQ 2020-08-03 11:22:32 +05:30
Field G. Van Zee
0651b466c2 Bugfixes, cleanup of sup dgemm ukernels.
Details:
- Fixed a few not-really-bugs:
  - Previously, the d6x8m kernels were still prefetching the next upanel
    of A using MR*rs_a instead of ps_a (same for prefetching of next
    upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
    that the upanels might be packed, using ps_a or ps_b is the correct
    way to compute the prefetch address.
  - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
    executed as intended even though it was based on a faulty pointer
    management. Basically, in the rd_d6x8m kernel, the pointer for B
    (stored in rdx) was loaded only once, outside of the jj loop, and in
    the second iteration its new position was calculated by incrementing
    rdx by the *absolute* offset (four columns), which happened to be the
    same as the relative offset (also four columns) that was needed. It
    worked only because that loop only executed twice. A similar issue
    was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
  - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
    that it is loaded only once outside of the loops rather than
    multiple times inside the loops.
  - Changed outer loop in rd kernels so that the jump/comparison and
    loop bounds more closely mimic what you'd see in higher-level source
    code. That is, something like:
      for( i = 0; i < 6; i+=3 )
    rather than something like:
      for( i = 0; i <= 3; i+=3 )
  - Switched row-based IO to use byte offsets instead of byte column
    strides (e.g. via rsi register), which were known to be 8 anyway
    since otherwise that conditional branch wouldn't have executed.
  - Cleaned up and homogenized prefetching a bit.
  - Updated the comments that show the before and after of the
    in-register transpositions.
  - Added comments to column-based IO cases to indicate which columns
    are being accessed/updated.
  - Added rbp register to clobber lists.
  - Removed some dead (commented out) code.
  - Fixed some copy-paste typos in comments in the rv_6x8n kernels.
  - Cleaned up whitespace (including leading ws -> tabs).
  - Moved edge case (non-milli) kernels to their own directory, d6x8,
    and split them into separate files based on the "NR" value of the
    kernels (Mx8, Mx4, Mx2, etc.).
  - Moved config-specific reference Mx1 kernels into their own file
    (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
  - Added rd_dMx1 assembly kernels, which seems marginally faster than
    the corresponding reference kernels.
  - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
    the row-oriented reference kernels for all storage combos.
2020-08-03 11:22:32 +05:30
Isuru Fernando
454047caa3 Add build instructions for Windows (#404) 2020-08-03 11:22:32 +05:30
Field G. Van Zee
713313562b Separate OS X and Windows into separate FAQs.
Details:
- Separated the unified Mac OS X / Windows frequently asked question
  into two separate questions, one for each OS.
2020-08-03 11:22:32 +05:30
Field G. Van Zee
967ddf2847 Updated sup performance graphs; added mt results.
Details:
- Reran all existing single-threaded performance experiments comparing
  BLIS sup to other implementations (including the conventional code
  path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
  showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
  (Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
2020-08-03 11:11:52 +05:30
Field G. Van Zee
573d99d05e Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
  to compile and run both single-threaded and multithreaded experiments.
  This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
  previous test/sup/octave scripts) as well as a test/sup/octave_mt
  directory (based on the previous test/supmt/octave scripts). The
  octave scripts are slightly different and not easily mergeable, and
  thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
  the previous test/supmt directory as test/sup/old/supmt.
2020-08-03 11:11:15 +05:30
Field G. Van Zee
9a02169f28 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.
2020-08-03 11:10:38 +05:30
Field G. Van Zee
a20b5e3c60 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
2020-08-03 11:10:38 +05:30
Field G. Van Zee
b2f628eab2 Fixed int-to-packbuf_t conversion error (C++ only).
Details:
- Fixed an error that manifests only when using C++ (specifically,
  modern versions of g++) to compile drivers in 'test' (and likely most
  other application code that #includes blis.h. Thanks to Ajay Panyala
  for reporting this issue (#374).
2020-08-03 11:05:17 +05:30
Dipal M Zambare
25d23cdda2 Zen3 support, disabled IR, JR loop parallelization
AMD-Internal: [CPUPL-1013]

Change-Id: I859152d63d1a56519c508dfa19587f25123e08b4
2020-07-24 20:55:47 +05:30
Meghana Vankadari
23ceeff7eb Using weighted thread range partitioning for GEMMT
Details:
- Since C is triangular, in order to maintain load balance among
  threads, we need to use weighted range partitioning.

Change-Id: I03d8ff71ac7af843acd787f1389b5907b56453ee
2020-07-24 19:27:54 +05:30
Meghana Vankadari
eae55852ba set the gemmt slot to the default gemmt sup handler for reference kernels
Change-Id: Ib309aba0cb08161877fd1a720ed65222d3b303f3
2020-07-24 19:27:33 +05:30
dzambare
9c7814da1c Added support for zen3 configuration
- User can now specify zen3 configuration,
      currently it reuses block sizes and kernels from zen2.
    - Auto configuration can detect and enable if zen3 config is needed
    - Added support for amd64 bundle which contains all zen platforms
    - Moved exiting amd bundle to amd64 legacy.

AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
2020-07-22 18:24:26 +05:30
Meghana Vankadari
6896f927da Fixed bug in SUP code path
Details:
- Since GEMM kernel prefers row-storage, if input C matrix is in col-major order,
  entire operation is transposed. In that case uplo(c) needs to be toggled
  before kernel-variant selection.
- disabled "bli_gemmsup_ref_var1n2m_opt_cases" inside gemmtsup.
- Updated version number  to 2.2.1

Change-Id: I0a85df1141fc4a98d98ea4e0c3d42db8602fa69b
2020-07-15 19:41:24 +05:30
nprasadm
af1f9ab98d BLIS: 'zdotc_' API modified to support Fortran invocation in flang environment.
1) Added dcomplex based zdotc_ version as a function with additional parameter.
2) The datatypes (single , double, Complex)  functions retained as the macros.
3) This modification handles the ZDOTC_ invocation from Fortran based application
   for 'double complex' datatypes.
4) The modifications are placed under macro 'AOCL_F2C'.
5) Blis, Blas Test suites verified ALL PASS with GCC and Flang
   + with and without 'AOCL_F2C' macro on Ubuntu machine.
6) Adding BLIS_EXPORT_BLAS to make the APIs visible when linking dll.

Change-Id: I4ada39a73f416e3794708f5b55e947342c261117
Signed-off-by: Meghana <Meghana.Vankadari@amd.com>, Nagendra <Nagendra.PrasadM@amd.com>
AMD-Internal: [SWLCSG-177]
2020-07-14 00:53:07 -04:00
Meghana Vankadari
6a0a65ee23 Added sup kernels and code path for gemmt similar to GEMM.GEMMT now also supports complex data types.
Details:
- Added framework code for GEMMT SUP.
- Implemented SUP for GEMMT using similar techniques as native path.
- Moved update routines to frame/util folder.
- Ported update routines for complex datatypes.

Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>,
	       Dipal M Zambare <DipalMadhukar.Zambare@amd.com>,
	       Mangala V <managala.v@amd.com>
2020-07-13 16:26:32 +05:30
Meghana Vankadari
f59d4befb5 Added framework support and interface APIs for GEMMT
Details:
- Added new API Which Computes a matrix-matrix product with general matrices
  but updates only the upper or lower triangular part of the result matrix. 
  cblas_?gemmt() and ?gemmt_().
- These routines are similar to the ?gemm routines, but they only access
  and update a triangular part of the square result matrix.
- Added DGEMMT functionality by reusing GEMM kernels.
- Created a new folder for GEMMT under l3, and added GEMMT specific
  framework code.
- Modified cntl_create routine to choose different macro kernel for
  GEMMT.
- Added routines to copy lower/upper triangular part of a block to the
  buffer.
- Defined BLIS, BLAS and CBLAS interface APIs for GEMMT.
- Added test_gemmt.c to test folder and Updated the Makefile.
- Added a macro 'CBLAS' in test_gemm.c to call CBLAS APIs.

Change-Id: Ie00c1a15b9c654b65c687a9ca781cbc6f9641791
2020-07-06 00:51:16 -04:00
Field G. Van Zee
32365b3ea5 Ensure random objects' 1-norms are non-zero.
Details:
- Fixed an innocuous bug that manifested when running the testsuite on
  extremely small matrices with randomization via the "powers of 2 in
  narrow precision range" option enabled. When the randomization
  function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
  then compute 0.0/0.0 during the normalization process, which leads to
  NaN residuals. The solution entails smarter implementaions of randv,
  randnv, randm, and randnm, each of which will compute the 1-norm of
  the vector or matrix in question. If the object has a 1-norm of 0.0,
  the object is re-randomized until the 1-norm is not 0.0. Thanks to
  Kiran Varaganti for reporting this issue (#413).
- Updated the implementation of randm_unb_var1() so that it loops over
  a call to the randv_unb_var1() implementation directly rather than
  calling it indirectly via randv(). This was done to avoid the overhead
  of multiple calls to norm1v() when randomizing the rows/columns of a
  matrix.
- Updated comments.

Change-Id: I0e3d65ff97b26afde614da746e17ed33646839d1
AOCL-2.2 2.2
2020-06-19 15:40:55 +05:30
phakumar
ccf0772d6e BLIS library porting on to Windows:
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
 AMD internal:[CPUPL-657]

Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
2020-06-16 18:29:00 +05:30
Dipal M Zambare
80b3127ff1 Added support for logging gemm input values.
Added BLIS specific extension to AOCL DTL, in this
added support to print the input matrix sizes from BLIS
library.

AMD Internal: [CPUPL-806]

Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561
2020-06-15 14:21:22 +05:30
Dipal M Zambare
dad7e2f235 Added support multiple trace levels & optimization of file size requirements
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.

Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.

AMD Internal: [CPUPL-806]

Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
2020-06-10 16:00:49 +05:30
prangana
3620e472e3 Replace back major version number variable in Makefile
Change-Id: I0f902e32085058ec618d08470793f5e5e49719b3
2020-06-10 13:11:14 +05:30
Dipal M Zambare
305c744131 Added traces in dgemm and sgemm paths.
Added traces from blas/cblas API's till kernels for dgemm and sgemm.
By default the traces will be disabled, user need to enable them
in their local workspace, please check aocl_dtl/aocldtlcf.h file.

AMD Internal : CPUPL-806

Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
2020-06-08 12:01:22 +05:30
Meghana
9fce1ec4a4 Optimized SGEMV kernel and changed BLAS interface call
Details:
- Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of sgemv to remove dependency on cntx
variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for SGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.

Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-707]
2020-06-04 02:49:08 -04:00
Dipal Madhukar Zambare
8a367c993e Merge "Checking for zero dimension is moved to bli_gemm_xx call." into amd-staging-rome-2.2 2020-06-04 02:16:56 -04:00
Meghana
f4d2bb2fed Enabled AOCC specific flags for all versions of AOCC compiler
Change-Id: Icad0ff1c1858c1762792ba8f2c5c3e846909cbb5
2020-06-03 10:50:00 -04:00
dzambare
5d57d67cb3 Checking for zero dimension is moved to bli_gemm_xx call.
This will ensure early return in case full gemm processing is not needed.

Based on dimension which is found to be zero following actions will be taken:

If 'c' has zero dimension, no further processing is requried
If alpha is zero or if 'a' or 'b' has zero diemension, we
perform scalm operation instead of gemm. (c = alpha*a + beta*b)

Change-Id: Icc031944fc4e80138adf991974547f2d57ab570b
AMD-Internal: [CPUPL-904]
2020-06-03 16:50:11 +05:30
managalv
b4e599ecc2 CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices
Failure was seen in libflame function (FLASH_UDdate_UT_inc)
Due to typecasting double complex pointer as double pointer

Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
2020-06-02 22:27:54 +05:30
Nallani Bhaskar
6f01cd2c54 Fix for sblat3.x failure in make check
Details:
Using of ymm registers storing 8 float values than 4 floats values
Changed register from ymm to xmm in required places. This can be found
only when leading dimension is greater than the actual dimension.

Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800
2020-06-01 17:04:59 +05:30
managalv
f7bc37ea32 CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings

Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05
2020-06-01 06:24:12 -04:00
Kiran Varaganti
c8f3cec5f7 Merge "Code cleanup in 6xk DGEMM pack Kernel" into amd-staging-rome-2.2 2020-06-01 05:08:58 -04:00