Files
blis/frame/thread
Eashan Dash c3d1a3878c Parallelized Pack and Compute Extension APIs
1. OpenMP based multi-threading parallelism is added for BLAS
   extension APIs of Pack and Compute

2. Both pack and compute APIs are parallelized.

3. Multi-threading of pack and compute APIs done with different
   number of threads can lead to inconsistent results due to
   output difference of the full packed matrix buffer when packed
   with different number of threads.

4. In multi-threaded execution, we ensure output of packed buffer
   is exactly the same as in single threaded execution.

5. Similarly for compute API, read of packed buffer in multi-
   threaded execution is exactly the same as in single-threaded
   execution.

6. Routines are added to compute the offsets for thread workload
   distribution for MT execution.
   1. The offsets are calculated in such a way that it resembles
      the reorder buffer traversal in single threaded reordering.
   2. The panel boundaries (KCxNC) remain as it is accessed in
      single thread, and as a consequence a thread with jc_start
      inside the panel cannot consider NC range for reorder.
   3. It has to work with NC' < NC, and the offset is calulated
      using prev NC panels spanning k dim + cur NC panel spaning
      pc loop cur iteration + (NC - NC') spanning current
      kc0 (<= KC).

7. Routines to ensure the same are added for MT execution
   1. frame/base/bli_pack_compute_utils.c
   2. frame/base/bli_pack_compute_utils.h

AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
2023-11-03 08:47:17 -04:00
..