mirror of
https://github.com/amd/blis.git
synced 2026-05-12 18:15:37 +00:00
1. OpenMP based multi-threading parallelism is added for BLAS
extension APIs of Pack and Compute
2. Both pack and compute APIs are parallelized.
3. Multi-threading of pack and compute APIs done with different
number of threads can lead to inconsistent results due to
output difference of the full packed matrix buffer when packed
with different number of threads.
4. In multi-threaded execution, we ensure output of packed buffer
is exactly the same as in single threaded execution.
5. Similarly for compute API, read of packed buffer in multi-
threaded execution is exactly the same as in single-threaded
execution.
6. Routines are added to compute the offsets for thread workload
distribution for MT execution.
1. The offsets are calculated in such a way that it resembles
the reorder buffer traversal in single threaded reordering.
2. The panel boundaries (KCxNC) remain as it is accessed in
single thread, and as a consequence a thread with jc_start
inside the panel cannot consider NC range for reorder.
3. It has to work with NC' < NC, and the offset is calulated
using prev NC panels spanning k dim + cur NC panel spaning
pc loop cur iteration + (NC - NC') spanning current
kc0 (<= KC).
7. Routines to ensure the same are added for MT execution
1. frame/base/bli_pack_compute_utils.c
2. frame/base/bli_pack_compute_utils.h
AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac