Commit Graph

408 Commits

Author SHA1 Message Date
Meghana
f4d2bb2fed Enabled AOCC specific flags for all versions of AOCC compiler
Change-Id: Icad0ff1c1858c1762792ba8f2c5c3e846909cbb5
2020-06-03 10:50:00 -04:00
managalv
f7bc37ea32 CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings

Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05
2020-06-01 06:24:12 -04:00
Nallani Bhaskar
2413c31672 CPUPL-923: Implemented dot Product Kernels in SGEMM SUP for transpose cases.
Details:
Added two new kernels bli_sgemmsup_rd_zen_asm_6x16m and bli_sgemmsup_rd_zen_asm_6x16n
to support dot product in Row Major (A * Tranpose(B)) and in Column Major (Tranpose(A) * B)

Change-Id: I264fd75c4c4b68fb7dc4fd229eaa44d09e9f3432
2020-05-31 22:37:03 +05:30
prangana
0c52aaefe1 Merge branch 'ref/heads/amd-staging-rome-2.2' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging-rome-2.2
Change-Id: I46acf48354ff73fb4eaeac255132d21095ea4d98
2020-05-30 10:31:10 +05:30
Kiran Varaganti
739803a441 DGEMM Packing Kernels for Native DGEMM implementation
[CPUPL-858] Packing kernels for dgemm 6x8 kernel are added explicitly
for zen2 configuration. Apart from generic packing kernels used by level-3
routines and for all combinations of the input parameters, introduced DGEMM
specific packing kernels for the case op(A) & op(B) is no transpose. This
helps us to vectorize these packing kernels and eliminate un-necessary branch
conditional checks. The packed kernels are also optimized at the boundary.
These boundary condition optimization help when the input matrix dimensions
"m" and "n" are not multiples of register block-sizes "MR & NR".
Typical DGEMM operation is C = beta*C + alpha *op(A) * op(B). Kindly note
the multiplication with alpha is handled inside kernel, hence in these dgemm
packing routines alpha is always consider 1.0. These routines are
"bli_dpackm_8xk_nn_zen" & "bli_dpackm_6xk_nn_zen". The generic packing
routines
are "bli_dpackm_6xk_gen_zen" & bli_dpackm_8xk_gen_zen". These routines are
enabled from "bli_cntx_init_zen2()" through bli_cntx_set_packm_kers(). In this
checkout wthe generic packing kernels are enabled by default". Later will
introduce run-time mechanism to change these packing kernels based on the
DGEMM input parameters.

Change-Id: I079b4dce0757d558224cb8c55d024bfea6a4de91
2020-05-28 02:01:43 -04:00
managalv
9b09dd7d6c CPUPL-929:Improve Complex GEMM performance
Added context for CCR format

Change-Id: I81ac1b882f176235b1c48f4952ec30f44c6b138c
2020-05-23 11:10:35 +05:30
managalv
11570dbc14 CPUPL-929:Improve Complex GEMM performance
Updated BLIS_MC value and created SUP context for CCC storage format

Change-Id: I5032b29834ea545d7b5f7a9469bc5655c71b7fe5
2020-05-22 10:47:28 +05:30
Field G. Van Zee
9e76059f15 Renamed bli_thread_obarrier(), _obroadcast().
Details:
- Renamed two bli_thread_*() APIs:
    bli_thread_obarrier()   -> bli_thread_barrier()
    bli_thread_obroadcast() -> bli_thread_broadcast()
  The 'o' was a leftover from when thrcomm_t objects tracked both
  "inner" and "outer" communicators. They have long since been
  simplified to only support the latter, and thus the 'o' is
  superfluous.

Change-Id: If9ec9a2383dfb02e1cfc74918f87a1fabddbd55b
2020-05-21 11:54:37 +05:30
Nicholai Tukanov
718b64814d Add prototypes for POWER9 reference kernels (#365)
Updates and fixes to power9 subconfig.

Details:
- Register s,c,z reference gemm and trsm ukernels that assume elements
  of B have been broadcast.
- Added prototypes for level-3 ukernels that assume elements of B have
  been broadcast. Also added prototype for an spackm function that
  employs a duplication/broadcast factor of 4.
- Register virtual gemmtrsm ukernels that work with broadcasting of B.
- Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
- Thanks to Nicholai Tukanov for providing these updates.
2020-05-21 11:40:00 +05:30
managalv
f630b3fc36 CPUPL-929:Improve Complex GEMM performance
Details:
SUP support added for ZGEMM for different storage formats in M direction
SUP kernels and sub kernels are implemented to cover all dimensions of square matrix
SUP kernels supports RRR, RCR, CRR, CCR storage formats

Change-Id: I2c846a430dfcf356cac8ebf62015b1f743157381
2020-05-20 17:36:04 +05:30
Meghana Vankadari
4fcc4e499d Optimized DGEMV kernel and changed BLAS interface call
Details:
- Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of dgemv to remove dependency on cntx variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for DGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
  They will be enabled for zen family processors in future.
- Changed naming convention for new BLAS macros to indicate their use.
- Added new optimized kernel for axpyf under zen2 folder.
- Implemented basic GEMV kernel without using axpyv or axpyf.
  This kernel is chosen for small sizes.

Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-885]
2020-05-19 06:40:44 -04:00
managalv
af1ad806f2 CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices
Details:
Supports cgemm SUP all storage formats for XXR format

Change-Id: I1f1ac6b47f0b54141acac65e2cb4f3a2aaa3bac6
2020-05-18 21:06:57 +05:30
managalv
310dda928f CPUPL-709: Improve Complex GEMM performance - Level 1 Optimization
Details
Added SUP support for cgemm in M direction
SUP kernels are 3x8m, 3x4m, 3x2m is implemeted
Sub kernels are implemented to support various dimenions
SUP CGEMM supports matrix C & A row/col major and Matrix B is row major matrix

Change-Id: Ia6854b929d3b5741a4900422d05df1257f5d014d
2020-05-18 20:43:49 +05:30
Devrajegowda, Kiran
884f2febd1 Revert "Block parameters tuning to improve sgemm performance on Rome"
Details:
    - Reverts commit 1c76723320.
    - Regression in Multi-Threaded sgemm performance with new block parameters on Rome

Change-Id: I67b050f6434f6ade2c982b3cd10aa863c0077601
Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-920]
2020-05-14 11:18:46 +05:30
Devrajegowda, Kiran
6f33fd6aac Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API
Details:
    -Kernel is called directly from API call to avoid framework
     overhead in case of single and double precisions.
    -Currently these changes are applicable only for zen2 configuration.
     They will be enabled for zen family processors in future.
    -These changes improve performance of BLAS and CBLAS interfaces of API.
     They do not affect BLIS-specific APIs.
    -setv simd kernel is added for single and double precision elements

Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a
    Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
    AMD-Internal: [CPUPL-817]
2020-05-13 01:51:48 -04:00
Devrajegowda, Kiran
4ad5b1a5e6 Update zen2 kernel context with number of level1 kernels
Details:
       - Adding missed copyv simd kernel changes while rebasing.

Change-Id: Iaedabd39fdf297fefec1e48e7d2c1a2f3d7eb08d
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 12:22:24 +05:30
Devrajegowda, Kiran
4caee59466 Adding a simd kernel for copyv function
Details:
    - Separate kernel for copyv function added to improve performance.
    - Modified cntx_init file in zen and zen2 configuration
    - Added test_copyv.c in test folder
    - Modified test/Makefile to include test_copyv.c

Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 01:55:25 -04:00
Devrajegowda, Kiran
1c76723320 Block parameters tuning to improve sgemm performance on Rome
Details:
    - Tuned block sizes to get better performance for sgemm default path.

Change-Id: I892e8642fa2d03a07a6d53537131536e6b1b091e
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-832]
2020-04-21 07:34:12 -04:00
Meghana
b846059bcf Added opt kernels for SWAPV
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
 SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c

Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
2020-04-20 01:21:44 -05:00
Vasanthakumar Rajagopal
489d501f2e Merge "Execution and Debug trace support." into amd-staging-rome-2.2 2020-04-15 06:30:50 -04:00
Meghana
e56cf63a3f Optimized "bli_dotv_zen_int10" kernels
Details:
- Fixed issues in "bli_dotv_zen_int10" kernels and optimized them.
- Changed cntx_init file to choose "bli_dotv_zen_int10" kernel for dotv
 API call.

Change-Id: Iee8d7519f3a22a2d41166390be6047e9cb37557f
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-824]
2020-04-14 09:52:57 +05:30
dzambare
d40edf7dac Execution and Debug trace support.
Added support add debug logging, execution trace and decode.

Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba
AMD-Internal: [CPUPL-806]
2020-04-07 08:48:59 +05:30
Field G. Van Zee
1a284828d1 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-03-13 01:09:29 -04:00
Meghana Vankadari
cc98047fd6 Made framework changes to initialize specific cache block sizes for TRSM.
Details:
-This commit addresses the performance optimization(single-thread and
 multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
 what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
 on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
 store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
 block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
 This macro is automatically defined for zen family architectures.
 It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.

Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
2020-03-09 10:33:42 +05:30
Meghana Vankadari
62e00b4d64 Merge "Change in threshold condition for trsm_small kernels" into amd-staging-rome-rel-2.1 2019-12-17 23:54:01 -05:00
Meghana Vankadari
8eb264f78b Change in threshold condition for trsm_small kernels
Change-Id: I396e246b1639d300fcb94bdf7e5fa8bc8c87e994
2019-12-16 18:54:48 +05:30
Devrajegowda, Kiran
1fe8edbed0 "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2019-12-16 14:48:53 +05:30
Kiran Varaganti
1650bcb623 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2019-12-13 00:01:35 -05:00
Devrajegowda, Kiran
e4a6af33f5 Merge Selective Packing code from amd branch flame/blis
Change-Id: I6d577f67ec84febe6af3635b10e5c9c77844ccd2
2019-12-12 15:22:21 +05:30
Nallani Bhaskar
af94ba29cf Added sup support for sgemm under zen and related frame work changes.
Change-Id: Ia7e88b96d3a3617e8d24754f50db081ffe2e9955
2019-12-04 10:56:10 +05:30
prangana
e0fb039a60 Merge branch 'amd' of https://github.com/flame/blis into amd-blis-nov-mergetest
Change-Id: I59325783883d67bb33e938aea8c34d8e3d6832fb
2019-11-30 12:52:14 +05:30
Field G. Van Zee
39fa7136f4 Added support for selective packing to gemmsup.
Details:
- Implemented optional packing for A or B (or both) within the sup
  framework (which currently only supports gemm). The request for
  packing either matrix A or matrix B can be made via setting
  environment variables BLIS_PACK_A or BLIS_PACK_B (to any
  non-zero value; if set, zero means "disable packing"). It can also
  be made globally at runtime via bli_pack_set_pack_a() and
  bli_pack_set_pack_b() or with individual rntm_t objects via
  bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
  interface of either the BLIS typed or object APIs. (If using the
  BLAS API, environment variables are the only way to communicate the
  packing request.)
- One caveat (for now) with the current implementation of selective
  packing is that any blocksize extension registered in the _cntx_init
  function (such as is currently used by haswell and zen subconfigs)
  will be ignored if the affected matrix is packed. The reason is
  simply that I didn't get around to implementing the necessary logic
  to pack a larger edge-case micropanel, though this is entirely
  possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
  bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
  with corresponding headers, in which higher-level packm-related
  functions are defined for use within the sup framework. The actual
  packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
  bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
  always NULL), and pointer to a thrinfo_t* (which for nowis the address
  of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
  the millikernel can query the panel stride of the packed matrix and
  step through it accordingly. If the matrix isn't packed, the panel
  stride of interest for the given millikernel will be set to the
  appropriate value so that the mkernel may step through the unpacked
  matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
  panel strides (ps_a and ps_b, respectively) instead of computing them
  on the fly.
- Spun off the environment variable getting and setting functions into
  a new file, bli_env.c (with a corresponding prototype header). These
  functions are now used by the threading infrastructure (e.g.
  BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
  infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
  for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
  This means that the function bli_thread_init_rntm() was renamed to
  bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
  functions that manage the pack_a and pack_b fields of the global
  rntm_t, including from environment variables, just as we have
  functions to manage the threading fields of the global rntm_t in
  bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
  spinning off the bli_l3_thread_decorator() functions into their own
  files. This change makes more sense when considering the further
  addition of bli_l3_sup_thread_decorator() functions (for now limited
  only to the single-threaded form found in the  _single.c file).
- Explicitly initialize the reference sup handlers in both
  bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
  obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
2019-11-29 15:27:07 -06:00
Devrajegowda, Kiran
85fa9e4107 resolved merge conflicts when merged with public repo master branch
Change-Id: Iad6ba809680ba5081cc9d7879794ef58cc8f8a40
2019-11-25 14:46:48 +05:30
Meghana
5560f75c0c Modified makefiles for zen and zen2 to pick up compiler flags based on architecture and compiler versions
Change-Id: I443e47c38e0ffd12f4b303f546abd46d02aa31ca
2019-11-21 09:53:44 +05:30
Field G. Van Zee
fb8bef9982 Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().
Details:
- Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
  manifested as failures in single-precision real level-3 operations.
  Also replaced the duplication factor constants with a const-qualifed
  varialbe, dfac, so that this won't happen again.
- Changed NC for single-precision real from 4080 to 8160 so that the
  packed matrix B will have the same byte footprint in both single
  and double real.
2019-11-14 13:05:28 -06:00
Devrajegowda, Kiran
b5475f527d Adding context initialisation for SUP kernels in zen2 architecture
Change-Id: I9de533abb039d0dff348728be51554cc53679d10
2019-11-12 13:59:26 +05:30
Field G. Van Zee
bdc7ee3394 Various fixes to support packing duplication in B.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
  those operations to be cast so the structured matrix is on the left.
  symm and hemm already had such macros, but these too were renamed so
  that the macros were individual to the operation. We now have four
  such macros:
    #define BLIS_DISABLE_HEMM_RIGHT
    #define BLIS_DISABLE_SYMM_RIGHT
    #define BLIS_DISABLE_TRMM_RIGHT
    #define BLIS_DISABLE_TRMM3_RIGHT
  Also, updated the comments in the symm and hemm front-ends related to
  the first two macro guards, and added corresponding comments to the
  trmm and trmm3 front-ends for the latter two guards. (They all
  functionally do the same thing, just for their specific operations.)
  Thanks to Jeff Hammond for reporting the bugs that led me to this
  change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
  related to duplicating B during packing) to register: a packing
  kernel for single-precision real; gemmbb ukernels for s, c, and z;
  trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
  and z; and to use non-default cache and register blocksizes for s, c,
  and z datatypes. Also declared prototypes for all of the gemmbb,
  trsmbb, and gemmtrsmbb ukernel functions within the
  bli_cntx_init_haswellbb() function. This should, once applied to the
  power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
  duplication factor of 4. This function is defined in the same file as
  bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
2019-11-11 15:47:17 -06:00
Field G. Van Zee
c84391314d Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
  directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
2019-11-04 13:57:12 -06:00
Nicholai Tukanov
b426f9e04e POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel 
  assumes that elements of B have been duplicated/broadcast during the
  packing step. The microkernel uses a column orientation for its 
  microtile vector registers and thus implements column storage and 
  general stride IO cases. (A row storage IO case via in-register
  transposition may be added at a future date.) It should be noted that 
  we recommend using this microkernel with gcc and *not* xlc, as issues 
  with the latter cropped up during development, including but not 
  limited to slightly incompatible vector register mnemonics in the GNU 
  extended inline assembly clobber list.
2019-11-01 17:57:03 -05:00
Kiran Varaganti
c3d4464f03 Removed extra 'endif' statement causing build failures for zen configuration - Fixed now
Change-Id: Ia7f164209124ffae5c70e1ff7c3d131cd44b9294
2019-10-24 14:56:04 +05:30
Devrajegowda, Kiran
4158e7fffe missed changes while rebasing field's SUP code
Change-Id: I560b93c42901ca2bbd4c22e833f55ba884a38a50
2019-10-23 10:33:43 +05:30
Field G. Van Zee
6218ac95a5 Merge branch 'master' into amd 2019-10-11 11:53:51 -05:00
Field G. Van Zee
0016d541e6 Changed -march=znver2 to =znver1 for clang on zen2.
Details:
- In config/zen2/make_defs.mk, changed the -march= flag so that
  -march=znver1 is used instead of -march=znver2 when CC_VENDOR is
  clang. (The gcc branch attempts to differentiate between various
  versions, but the equivalent version cutoffs for clang are not
  yet known by us, so we have to use a single flag for all versions
  of clang. Hopefully -march=znver1 is new enough. If not, we'll
  fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
  This issue was discovered thanks to AppVeyor.
2019-10-11 11:09:44 -05:00
Field G. Van Zee
e94a0530e5 Corrected zen NC that was non-multiple of NR.
Details:
- Updated an incorrectly set cache blocksize NC for single real within
  config/zen/bli_cntx_init_zen.c that was non a multiple of the
  corresponding value of NR. This issue, which was caught by Travis CI,
  was introduced in 29b0e1e.
2019-10-11 10:48:27 -05:00
Field G. Van Zee
29b0e1ef4e Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
  into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
  inadvertantly not incremented when the Zen2 subconfiguration was
  added.
- In bli_gemm_front(), added a missing conditional constraint around the
  call to bli_gemm_small() that ensures that the computation precision
  of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
  that existed around the call to bli_syrk_small() into bli_syrk_small()
  to minimize the calling code footprint and also to bring that code
  into stylistic harmony with similar code in bli_gemm_front() and
  bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
  proper accessor static functions (e.g. 'a->dim[0]' becomes
  'bli_obj_length( a )').
- Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
  bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
  strictly speaking unnecessary, but it serves as a useful visual cue to
  those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
  bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
  version check for availability of -march=znver2, and added appropriate
  support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
  config/zen/amd_config.mk, including: removal of -march=znver1 et al.
  from CKVECFLAGS (since the -march flag is added within make_defs.mk);
  setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
  added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
  set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.
2019-10-11 10:24:24 -05:00
Kiran Varaganti
ea25ba255a Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code clean
Change-Id: I6827b58d2dab1041fe182fef5a007b679ac4bb1f
2019-09-19 00:13:35 +05:30
Field G. Van Zee
31c8657f1d Added support for pre-broadcast when packing B.
Details:
- Added support for being able to duplicate (broadcast) elements in
  memory when packing matrix B (ie: the left-hand operand) in level-3
  operations. This turns out advantageous for some architectures that
  can afford the cost of the extra bandwidth and somehow benefit from
  the pre-broadcast elements (and thus being able to avoid using
  broadcast-style load instructions on micro-rows of B in the gemm
  microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
  hemm_r is implemented in terms of hemm_l (and symm_r in terms of
  symm_l). This is needed when broadcasting during packing because the
  alternative--supporting the broadcast of B while also allowing matrix
  B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
  (as well as for general-purpose buffers). In addition, we support
  byte offsets from those alignment values (which is different from
  aligning by align+offset bytes to begin with). The default alignment
  values are BLIS_PAGE_SIZE in all four cases, with the offset values
  defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
  into the packm kernel, where it will be needed by packm kernels that
  perform broadcasts of B, since the idea is that we *only* want to
  broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
  used to set custom virtual level-3 microkernels in the cntx_t, which
  would typically be done in the bli_cntx_init_*() function defined in
  the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
  defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
  defined in ref_kernels/3/bb. (These kernels have been tested with
  double real with NP/NR = 12/6.)
- Added #ifndef ... #endif guards around several macro constants defined
  in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
  frame/include/level0/bb for use by "broadcast B"-style packm reference
  kernels. For now, only the real domain kernels are tested and fully
  defined.
- Output the alignment and offset values for packed blocks of A and B
  in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.
2019-09-17 17:42:10 -05:00
Devin Matthews
138d403b6b Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331) 2019-08-26 18:11:27 -05:00
Kiran Varaganti
b5eb348734 config/zen/bli_cntx_init_zen.c: removed BLIS_ENBLE_ZEN_BLOCK_SIZES macro. We have different configurations for both zen and zen2
config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H.
test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples

Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2
2019-08-23 14:18:55 +05:30
Kiran Varaganti
d605a19e42 config_registry: New AMD zen2 architecture configuration added.
frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS]
  frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...).
  frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..).
  frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif
  frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif
  frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20

Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866
2019-08-23 14:18:55 +05:30