Commit Graph

143 Commits

Author SHA1 Message Date
Eashan Dash
e4e4fe55fb Added Parameter Checks and DTL Trace for Extension APIs
1. Added input parameter checking for the extension APIs
   1. gemm_pack_get_size API
   2. gemm_pack API

2. Additionally added early returns for these APIs when
   m or n dimensions are 0.

3. Routines for input parameter check for all the 3
   BLAS extension APIs - gemm_pack_get_size, gemm_pack and
   gemm_compute are defined in:
   frame/compat/check/bla_gemm_pack_compute_check.h

4. Added AOCL DTL TRACE for all the functions of
   1. gemm_pack_get_size
   2. gemm_pack
   3. gemm_compute

AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
2023-11-09 18:53:59 +05:30
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Edward Smyth
9500cbee63 Code cleanup: spelling corrections
Corrections for some spelling mistakes in comments.

AMD-Internal: [CPUPL-3519]
Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba
2023-11-09 00:16:30 -05:00
Eashan Dash
c3d1a3878c Parallelized Pack and Compute Extension APIs
1. OpenMP based multi-threading parallelism is added for BLAS
   extension APIs of Pack and Compute

2. Both pack and compute APIs are parallelized.

3. Multi-threading of pack and compute APIs done with different
   number of threads can lead to inconsistent results due to
   output difference of the full packed matrix buffer when packed
   with different number of threads.

4. In multi-threaded execution, we ensure output of packed buffer
   is exactly the same as in single threaded execution.

5. Similarly for compute API, read of packed buffer in multi-
   threaded execution is exactly the same as in single-threaded
   execution.

6. Routines are added to compute the offsets for thread workload
   distribution for MT execution.
   1. The offsets are calculated in such a way that it resembles
      the reorder buffer traversal in single threaded reordering.
   2. The panel boundaries (KCxNC) remain as it is accessed in
      single thread, and as a consequence a thread with jc_start
      inside the panel cannot consider NC range for reorder.
   3. It has to work with NC' < NC, and the offset is calulated
      using prev NC panels spanning k dim + cur NC panel spaning
      pc loop cur iteration + (NC - NC') spanning current
      kc0 (<= KC).

7. Routines to ensure the same are added for MT execution
   1. frame/base/bli_pack_compute_utils.c
   2. frame/base/bli_pack_compute_utils.h

AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
2023-11-03 08:47:17 -04:00
Edward Smyth
6d0444497f Improvements to xerbla functionality
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
  setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
  is controlled by setting the environment variable
  BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
  assuming xerbla was not set to stop on error. Example call is

     info = bli_info_get_info_value();

The default behaviour remains to print but don't stop on error,
i.e. the equivalent to

     export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0

Implementation details:
- Values of the environment variables are stored and retrieved
  from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
  to 0 during initialization for all calls and updated by xerbla
  if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
  calls xerbla) will reinitialize info_value to 0 via call to
  bli_thread_update_rntm_from_env

AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
2023-10-16 08:48:51 -04:00
Arnav Sharma
c8f14edcf5 BLAS Extension API - ?gemm_compute()
- Added support for 2 new APIs:
	1. sgemm_compute()
	2. dgemm_compute()
  These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
  APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
  packed matrix identifier) and performs the GEMM operation:
  C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
  scheme isn't matching, and the respective matrix being loaded isn't
  packed either, on-the-go packing has been enabled for such cases to
  pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
  it is the responsibility of the user to pack only one matrix with
  alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
  and compute APIs are forced to take n_threads=1.

AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
2023-10-16 08:18:52 -04:00
Vignesh Balasubramanian
81161066e5 Multithreading the DNRM2 and DZNRM2 API
- Updated the bli_dnormfv_unb_var1( ... ) and
  bli_znormfv_unb_var1( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Added the logic to distribute the job among the threads such
  that only one thread has to deal with fringe case(if required).
  The remaining threads will execute only the AVX-2 code section
  of the computational kernel.

- Added reduction logic post parallel region, to handle overflow
  and/or underflow conditions as per the mandate. The reduction
  for both the APIs involve calling the vectorized kernel of
  dnormfv operation.

- Added changes to the kernel to have the scaling factors and
  thresholds prebroadcasted onto the registers, instead of
  broadcasting every time on a need basis.

- Non-unit stride cases are packed to be redirected to the
  vectorized implementation. In case the packing fails, the
  input is handled by the fringe case loop in the kernel.

- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
  and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
  cases of size = 2 ( and ) size = 1 or non-unit strides respectively.

AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
2023-10-16 07:26:27 -04:00
eashdash
30bdeecbcc Added BLAS Extension APIs - Get Size and Pack API
1. 4 new APIs are added to support packed compute GEMM operations
   1. dgemm_pack_get_size
   2. sgemm_pack_get_size
   3. dgemm_pack
   4. sgemm_pack

2. Pack_get_size API
   1. Returns size in bytes required for packing of input
   2. Requires identifier to identify the input matrix to
      be packed
   3. Additionally requires 3 integer parameters for input
      dimensions

3. Packed buffer is allocated using the pack size computed

4. Pack API:
   1. Performs full matrix packing of the input
   2. Additionally, performs the alpha scaling
   3. Packed buffer created contains the full packed matrix

5. The GEMM compute calls are required to be operated on the
   packed buffer with alpha = 1 since alpha scaling is already
   done by the Pack API

6. GEMM Pack API eliminate the cost of packing the input matrixes
   by avoiding on the go pack in the GEMM 5 loop.
   Packing of input matrixes are done when there is resue of
   matrixes across different GEMM calls.

AMD-Internal: [CPUPL-3560]
Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3
2023-10-04 06:43:59 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Shubham Sharma
0000cc88de Removed local copy of cntx in TRSM
- TRSM and GEMM has different blocksizes in zen4, in order
  to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
  stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
  are removed. Instead of overriding the default blocksizes,
  TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
  and GEMM pack buffers have to be packed with default blocksizes.
  To check if we are packing for TRSM, "family" argument is added
  in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
  it is not set then BLIS_GEMM_UKR has to be used. This functionality
  has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
  to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
  signature of bli_packm_init_pack.

AMD-Internal: [CPUPL-3781]

Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
2023-08-16 08:09:01 -04:00
Harihara Sudhan S
9272d3c778 Bug fix in work load distribution among the given threads
- In level-1 kernels, with multi-threading enabled, only the partial
  job was getting executed.
- The bug was in bli_thread_vector_partition and occurred only
  when minimum work for a thread >= 1 i.e., when the number of threads
  launched is less than number of elements and the number of elements
  is not a multiple of the number of threads launched.

AMD-Internal: [CPUPL-3231]
Change-Id: Ie20abb93468282cd6ac2372267714fb80c26d7cc
2023-04-18 10:16:09 -04:00
Harihara Sudhan S
32bbd96652 Moving AOCL Dynamic logic from BLIS impli layer
Threading related changes
--------------------------

- Created function bli_nthreads_l1 that dispatches the AOCL dynamic
  logic for a L1 function based on the kernel ID and input datatypes.
- bli_nthreads_l1 gets the number of threads to be launched from the
  rntm variable.
- Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and
  DDOTV. This function contains the AOCL dynamic logic for the
  respective kernels.
- Added handling for cases when number of elements (n) is less than
  number of threads spawned (nt) in AOCL dynamic.
- Added function bli_thread_vector_partition that calculates the
  amount of work the calling thread is supposed to perform on a
  vector.

Interface changes
-----------------

- In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick
  kernel based on architecture ID and removed AVX2 flag check.
- Modified function signature of ZDSCALV. Alpha is passed as dcomplex
  and only the real part of the alpha passed is used inside the kernel.
  The change was done to facilitate kernel dispatch based on arch ID.
- Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV.
  Without this multithreaded code might crash because of minimum work
  calculation.

Misc
-----

- Removed unused variables from ZSCAL2V and AXPYV kernels.

AMD-Internal: [CPUPL-3095]
Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9
2023-04-14 02:05:38 -04:00
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Edward Smyth
2592774fe8 BLIS: Nested Parallelism issues (3)
Bugfix for parallel BLAS1 and BLAS2 routines. Threading information
was not being set correctly when initializing local rntm from global.

Also ensure th_rntm is initialized along with global_rntm by updating
it in bli_thread_init(), called by bli_init_once()

AMD-Internal: [CPUPL-2433]
Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf
2022-12-13 12:38:40 -05:00
Edward Smyth
345aacf806 BLIS: Nested parallelism issues (2)
Improvements to recent parallelism changes:

1. BLIS specific threading options: In bli_thread_update_rntm_from_env()
   set threading variables in tl_rntm to serial values when OpenMP level
   for parallelism within BLIS will not be active. User supplied BLIS
   threading values remain unchanged in global_rntm.

2. Simplify code structure in bli_thread_update_rntm_from_env().

3. Change variable declarations in bli_thread_init_rntm_from_env()
   and bli_thread_update_rntm_from_env() to avoid unused variable
   warnings in non-OpenMP builds.

AMD-Internal: [CPUPL-2433]
Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd
2022-12-07 04:34:07 -05:00
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Sireesha Sanga
22af681a11 Runtime Thread Control Feature Update
Details:
1.  Runtime Thread Control Feature is enhanced to create a provision
    for the application to allocate a different number of threads to
    BLIS from the number of threads application is using for itself.

2.  In the previous implementation, if application sets BLIS_NUM_THREADS
    with a valid value, BLIS internally calls omp_set_num_threads() API with
    same value. Due to this, application could not differentiate between
    the number of threads used in BLIS library and the application.

3.  With the current solution, if Application wants to allocate
    different number of threads for BLIS API and application, Application
    can choose either BLIS_NUM_THREADS environment variable or
    bli_thread_set_num_threads(nt) API for BLIS,
    and OpenMP APIs or environment variables for itself, respectively.

4.  If BLIS_NUM_THREADS is set with a valid value, same value
    will be used in the subsequent parallel regions unless
    bli_thread_set_num_threads() API is used by the Application
    to modify the desired number of threads during BLIS API execution.

5.  Once BLIS_NUM_THREADS environment variable or
    bli_thread_set_num_threads(nt) API is used by the application,
    BLIS module would always give precedence to these values. BLIS API would
    not consider the values set using OpenMP API omp_set_num_threads(nt) API
    or OMP_NUM_THREADS environment variable.

6.  If BLIS_NUM_THREADS is not set, then if Application is multithreaded and
    issued omp_set_num_threads(nt) with desired number of threads,
    omp_get_max_threads() API will fetch the number of threads set earlier.

7.  If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called
    by the application, but only OMP_NUM_THREADS is set,
    omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS.

8.  If both environment variables are not set, or if they are set with
    invalid values, and omp_set_num_threads(nt) is not issued by
    application, omp_get_max_threads() API will return the number of the
    cores in the current context.

9.  BLIS will initialize rntm->num_threads with the same value.
    However if omp_set_nested is false - BLIS APIs called from parallel
    threads will run in sequential. But if nested parallelism is enabled
    Then each application will launch MT BLIS.

10. Order of precedence used for number of threads:
      0. value set using bli_thread_set_num_threads(nt) by the application
      1. valid value set for BLIS_NUM_THREADS environment variable
      2. omp_set_num_threads(nt) issued by the application
      3. valid value set for OMP_NUM_THREADS environment variable
      4. Number of cores

11. If nt is not a valid value for omp_set_num_threads(nt) API,
    number of threads would be set to 1.
    omp_get_max_threads() API will return 1.

12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled.

AMD-Internal: [CPUPL-2342]
Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6
2022-08-19 05:43:01 -04:00
mkadavil
31f8820bab Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path.
- auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set
irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at
runtime. Currently the auto_factor is enabled if num_threads > 0 and not
reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm.
This results in gemm/gemmt SUP path applying 2x2 factorization of
num_threads, and thereby modifying the preset factorization. This issue
is not observed in native path since factorization happens without
checking auto_factor value.
- Setting omp threads to n_threads using omp_set_num_threads after the
global_rntm n_threads update in bli_thread_set_num_threads. This ensures
that in bli_rntm_init_from_global, omp_get_max_threads returns the same
value as set previously.

AMD-Internal: [CPUPL-2137]
Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426
2022-05-17 18:10:40 +05:30
Dipal M Zambare
e712ffe139 Added AOCL progress support for BLIS
-- AOCL libraries are used for lengthy computations which can go
     on for hours or days, once the operation is started, the user
     doesn’t get any update on current state of the computation.
     This (AOCL progress) feature enables user to receive a periodic
     update from the libraries.
  -- User registers a callback with the library if it is interested
     in receiving the periodic update.
  -- The library invokes this callback periodically with information
     about current state of the operation.
  -- The update frequency is statically set in the code, it can be
     modified as needed if the library is built from source.
  -- These feature is supported for GEMM and TRSM operations.

  -- Added example for GEMM and TRSM.
  -- Cleaned up and reformatted test_gemm.c and test_trsm.c to
     remove warnings and making indentation consistent across the
     file.

AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
2022-05-17 18:10:39 +05:30
Sireesha Sanga
6a2c4acc66 Runtime Thread Control using OpenMP API
Details:
-  During runtime, Application can set the desired number of threads using
   standard OpenMP API omp_set_num_threads(nt).
-  BLIS Library uses standard OpenMP API omp_get_max_threads() internally,
   to fetch the latest value set by the application.
-  This value will be used to decide the number of threads in the subsequent
   BLAS calls.
-  At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable
   will be given precedence, over the OpenMP standard API omp_set_num_threads(nt)
   and OMP_NUM_THREADS environment variable.
-  Order of precedence followed during BLIS Initialization is as follows
	1. Valid value of BLIS_NUM_THREADS
	2. omp_set_num_threads(nt)
	3. valid value of OMP_NUM_THREADS
	4. Number of cores
-  After BLIS initialization, if the Application issues omp_set_num_threads(nt)
   during runtime, number of threads set during BLIS Initialization,
   is overridden by the latest value set by the Application.
-  Existing precedence of BLIS_*_NT environment variables and the decision of
   optimal number of threads over the number of threads derived from the
   above process remains as it is.

AMD-Internal: [CPUPL-2076]

Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8
2022-05-17 18:09:22 +05:30
mkadavil
457c33a601 Eliminating barriers in SUP path when matrices are not packed.
-Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub
to generate per thread data ranges at each loop of gemm algorithm.
bli_thrinfo_sup_grow involves usage of multiple barriers for cross
thread synchronization. These barriers are necessary in cases where
either the A or B matrix are packed for centralized pack buffer
allocation/deallocation (bli_thread_am_ochief thread).

-However for cases where both A and B matrices are unpacked, these
barrier are resulting in overhead for smaller dimensions. Here creation
of unnecessary communicators are avoided and subsequently the
requirement for barriers are eliminated when packing is disabled for
both the input matrices in SUP path.

Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa
2022-01-06 01:56:43 -05:00
Kiran Varaganti
d26089c665 Multi-threaded BLIS - OpenMP
Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application
by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes
are not set, number of threads is inferred by calling the API - omp_get_max_threads().
Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads.
This feature is only enabled when BLIS is configured with openmp parallelization.

Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6
2021-06-17 05:17:37 -04:00
Kiran Varaganti
c2abbcab96 Fix dgemm_ Multi-thread running as Single Thread
Details:
When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or
BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed.
Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1
irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls
small_gemm which ends up running sequentially.
Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or
BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one.
Add fix for zgemm_ also.

Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573
2021-06-15 12:14:11 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Field G. Van Zee
09bd4f4f12 Add err_t* "return" parameter to malloc functions.
Details:
- Added an err_t* parameter to memory allocation functions including
  bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
  bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
  already use the return value to return the allocated memory address,
  they can't communicate errors to the caller through the return value.
  This commit does not employ any error checking within these functions
  or their callers, but this sets up BLIS for a more comprehensive
  commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
  bli_type_defs.h. This was done so that what remains of bli_malloc.h
  can be included after the definition of the err_t enum. (This ordering
  was needed because bli_malloc.h now contains function prototypes that
  use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
  bli_param_macro_defs.h. These functions provide easy checks for error
  codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
  the API for bli_malloc_user(), which is an exported symbol in the
  shared library. However, it's quite possible that the only application
  that calls bli_malloc_user()--indeed, the reason it is was marked for
  symbol exporting to begin with--is the BLIS testsuite. And if that's
  the case, this breakage won't affect anyone. Nonetheless, the "major"
  part of the so_version file has been updated accordingly to 4.0.0.
2021-03-31 17:09:36 -05:00
Field G. Van Zee
3a6f41afb8 Renamed membrk files/vars/functions to pba.
Details:
- Renamed the files, variables, and functions relating to the packing
  block allocator from its legacy name (membrk) to its current name
  (pba). This more clearly contrasts the packing block allocator with
  the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
  caused the function to erroneously change the value of the pack_a
  field of the global rntm_t instead of the pack_b field. (Apparently
  nobody has used this API yet.)
- Comment updates.
2021-03-27 17:22:14 -05:00
Field G. Van Zee
36cb4116d1 Switch allocator mutexes to static initialization.
Details:
- Switched the small block allocator (sba), as defined in bli_sba.c and
  bli_apool.c, to static initialization of its internal mutex. Did a
  similar thing for the packing block allocator (pba), which appears as
  global_membrk in bli_membrk.c.
- Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex()
  to ensure they won't be used in the future.
- In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp
  blocks guarded by BLIS_USE_PTHREAD_MUTEX.
2021-03-27 15:15:09 -05:00
Field G. Van Zee
a4b73de84c Disabled _self() and _equal() in bli_pthread API.
Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
  introduced in d479654. These functions were disabled after I realized
  that they aren't actually needed yet. Thanks to Devin Matthews for
  helping me reason through the appropriate consumer code that will
  appear in BLIS (eventually) in a future commit. (Also, I could never
  get the Windows branch to link properly in clang builds in AppVeyor.
  See the comment I left in the code, and #485, for more info.)
2021-03-12 19:47:39 -06:00
Field G. Van Zee
f9d604679d Added _self() and _equal() to bli_pthread API.
Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
  and pthread_equal(). Implemented these two functions for all three cpp
  branches present within bli_pthread.c: systemless, Windows, and
  Linux/BSD.
2021-03-12 19:47:39 -06:00
Field G. Van Zee
fa9b3c8f6b Shuffled code in Windows branch of bli_pthreads.c.
Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
  defines the bli_pthreads API in terms of Windows API calls. Also added
  missing comments that mark sections of the API, which brings the code
  into harmony with other cpp branches (as well as bli_pthread.h).
2021-03-11 15:13:51 -06:00
Field G. Van Zee
95d4f3934d Moved cpp macro redef of strerror_r to bli_env.c.
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
  (in terms of strerror_s) from bli_thread.h to bli_env.c. It was
  likely left behind in bli_thread.h in a previous commit, when code
  that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
  find any other instance of strerror_r being used in BLIS, so I moved
  the #define directly to bli_env.c rather than place it in bli_env.h.)
  The code that uses strerror_r is currently disabled, though, so this
  commit should have no affect on BLIS.
2021-03-11 13:50:40 -06:00
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Field G. Van Zee
11dfc176a3 Reorganized thread auto-factorization logic.
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
  guts were factored out into "fast" and "slow" variants. Then added
  logic to the "fast" variant that allows for more optimal thread
  factorizations in some situations where there is at least one factor
  of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
  added comments to that file describing BLIS_THREAD_RATIO_? and
  BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
  macros not used in vanilla BLIS and removed the unused macro
  BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
  and bli_trsm_front.c. (These branches of small matrix handling have
  not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
2020-12-01 19:51:27 +00:00
Field G. Van Zee
64856ea5a6 Auto-reduce (by default) prime numbers of threads.
Details:
- When requesting multithreaded parallelism by specifying the total
  number of threads (whether it be via environment variable, globally at
  runtime, or locally at runtime), reduce the number of threads actually
  used by one if the original value (a) is prime and (b) exceeds a
  minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
  to 11 by default. If, when specifying the total number of threads (and
  not the individual ways of parallelism for each loop), prime numbers
  of threads are desired, this feature may be overridden by defining the
  BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
  corresponds to the configuration family targeted at configure-time.
  (For now, there is no configure option(s) to control this feature.)
  Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
  bool that determines whether an integer is prime. This function is
  implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
  with unrelated minor edits.
2020-11-23 16:54:51 -06:00
Field G. Van Zee
9bb23e6c2a Added support for systemless build (no pthreads).
Details:
- Added a configure option, --[enable|disable]-system, which determines
  whether the modest operating system dependencies in BLIS are included.
  The most notable example of this on Linux and BSD/OSX is the use of
  POSIX threads to ensure thread safety for when application-level
  threads call BLIS. When --disable-system is given, the bli_pthreads
  implementation is dummied out entirely, allowing the calling code
  within BLIS to remain unchanged. Why would anyone want to build BLIS
  like this? The motivating example was submitted via #454 in which a
  user wanted to build BLIS for a simulator such as gem5 where thread
  safety may not be a concern (and where the operating system is largely
  absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
  the implementation of bli_clock() unconditionally returns 0.0 instead
  of the time elapsed since some fixed point in the past. The reasoning
  for this is that if the operating system is truly minimal, the system
  function call upon which bli_clock() would normally be implemented
  (e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
  to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
  from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
  and BLISTypedAPI.md, with a note that both are non-functional when
  BLIS is configured with --disable-system.
2020-11-16 15:55:45 -06:00
Kiran Varaganti
fb0b4b57c1 Fixed missing changes
Corrections in bli_gemm_front.c, taken the corrections from both public repo and 2.2.1 branch
[CPUPL-1067]

Change-Id: I4887ece6aa20bdfb87d97e7acebbe04cb9feea02
2020-08-13 00:29:37 +05:30
dzambare
267a959af1 Rebased amd-staging-milan-3.0 branch on master
-- Rebased on top of master commit # 6e522e5823
  -- Updated merged code to remove duplicated code added by auto-merging
  -- Updated merged code to rename bool_t type
  -- Updated merged code to rename bli_thread_obarrier
  -- Updated merged code to rename bli_thread_obroadcast

Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c
AMD-Internal: [CPUPL-1067]
2020-08-06 10:09:29 +05:30
Field G. Van Zee
cec95ea3a9 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.

Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
2020-08-06 10:09:28 +05:30
Field G. Van Zee
2839b88f00 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran
6b5c68b9ed "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2020-08-06 10:09:28 +05:30
Kiran Varaganti
307ddc3110 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2020-08-06 10:09:28 +05:30
Field G. Van Zee
e4d07f93db Removed export macros from all internal prototypes.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
  BLIS_EXPORT_BLIS from all function prototypes *except* those that we
  potentially wish to be exported in shared/dynamic libraries. In other
  words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
  functions that can be considered private or for internal use only.
  This is likely the last big modification along the path towards
  implementing the functionality spelled out in issue #248. Thanks
  again to Isuru Fernando for his initial efforts of sprinkling the
  export macros throughout BLIS, which made removing them where
  necessary relatively painless. Also, I'd like to thank Tony Kelman,
  Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
  participating in the initial discussion in issue #37 that was later
  summarized and restated in issue #248.
- CREDITS file update.
2020-08-03 11:47:18 +05:30
Isuru Fernando
ceda852482 Add symbol export macro for all functions (#302)
* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now
2020-08-03 11:42:15 +05:30
Field G. Van Zee
fd5db714f4 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-08-03 11:27:13 +05:30
Field G. Van Zee
8b9257df67 Cleaned up bool_t usage and various typecasts.
Details:
- Fixed various typecasts in

    frame/base/bli_cntx.h
    frame/base/bli_mbool.h
    frame/base/bli_rntm.h
    frame/include/bli_misc_macro_defs.h
    frame/include/bli_obj_macro_defs.h
    frame/include/bli_param_macro_defs.h

  that were missing or being done improperly/incompletely. For example,
  many return values were being typecast as
    (bool_t)x && y
  rather than
    (bool_t)(x && y)
  Thankfully, none of these deficiencies had manifested as actual bugs
  at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
  This reflects the fact that bli_env_get_var() needs to be able to
  return a signed integer, and even though dim_t is currently defined
  as a signed integer, it does not intuitively appear to necessarily be
  signed by inspection (i.e., an integer named "dim_t" for matrix
  "dimension"). Also, updated use of bli_env_get_var() within
  bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
  and added comments to the bli_thrcomm_*.h files that will explain a
  planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
  'bool' for 'bool_t', which will eliminate the namespace conflict with
  arm_sve.h as reported in issue #420. This commit implements the first
  phase of that transition. Thanks to RuQing Xu for reporting this
  issue.
- CREDITS file update.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
3eef698711 Declare/define static functions via BLIS_INLINE.
Details:
- Updated all static function definitions to use the cpp macro
  BLIS_INLINE instead of the static keyword. This allows blis.h to
  use a different keyword (inline) to define these functions when
  compiling with C++, which might otherwise trigger "defined but
  not used" warning messages. Thanks to Giorgos Margaritis for
  reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
  hardware auto-detection facility, to unconditionally #define
  BLIS_INLINE to the static keyword (since we know BLIS will be
  compiled with C, not C++):
    build/detect/config/config_detect.c
    frame/base/bli_arch.c
    frame/base/bli_cpuid.c
- CREDITS file update.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
9a02169f28 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.
2020-08-03 11:10:38 +05:30
Field G. Van Zee
a20b5e3c60 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
2020-08-03 11:10:38 +05:30
Field G. Van Zee
00e14cb6d8 Replaced use of bool_t type with C99 bool.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
  C99 bool type. A few remaining instances, such as those in the files
  bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
  bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
  used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
  C99's bool instead of bool_t, which was raised in issue #420. The first
  phase, which cleaned up various typecasts in preparation for using
  bool as the basis for bool_t (instead of gint_t), was implemented by
  commit a69a4d7. The second phase, which redefined the bool_t typedef
  in terms of bool (from gint_t), was implemented by commit 2c554c2.
2020-07-29 14:24:34 -05:00
Meghana Vankadari
23ceeff7eb Using weighted thread range partitioning for GEMMT
Details:
- Since C is triangular, in order to maintain load balance among
  threads, we need to use weighted range partitioning.

Change-Id: I03d8ff71ac7af843acd787f1389b5907b56453ee
2020-07-24 19:27:54 +05:30