- Disabled topology detection as libgomp is not honoring
the standard function omp_get_place_proc_ids
- Added B prefetch in bf16 B packing kernels
AMD-Internal: SWLCSG-3761
- The current implementation of the topology detector establishes
a contingency, wherein it is expected that the parallel region
uses all the threads queried through omp_get_max_threads(). In
case the actual parallelism in the function is limited(lower than
this expectation), the code may access unallocated memory section
(using uninitialized pointers).
- This was because every thread(having it's own pointer), sets its
initial value to NULL inside the parallel section, thereby leaving
some pointers uninitialized if the associated thread is not spawned.
- Also, the current implementation would use negative indexing(with -1)
if any associated thread was not spawned.
- Fix : Set every thread-specific pointer to NULL outside the parallel
region, using calloc(). As long as we have NULL checks for pointers
before accessing through them, no issues will be observed. Avoid
incurring the topology detection cost if all the reuqired threads
are not spawned(thereby avoiding potential negative indexing).
(when using core-group ID).
AMD-Internal: [SWLCSG-3573]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
-In multi-threaded cases if a packed/close pattern thread to core
binding is used (e.g.: OMP_PROC_BIND=close and OMP_PLACES=core|threads),
LPGEMM (OMP framework) launches threads such that threads with adjacent
id's are bound to nearby (even adjacent) cores. Depending on the
processor architecture, multiple threads with adjacent id's can be bound
to cores sharing the same last level cache. However it was observed that
when these threads (with adjacent id's) access the B reorder buffer, the
last level cache access was suboptimal. This can be attributed to the
per thread reorder buffer block accesses and how it maps to the last
level cache.
-In these cases, m is small (<= 4 * MR) and n value is such that number
of NR blocks (n/NR) is less than available threads nt (like < 0.5 * nt).
In such cases, id's of the threads can be modified such that the number
of threads with adjacent id's bound to the last level cache can be
reduced. This looks similar to the spread pattern used in thread to core
binding. This reduces the load on last level cache due to reorder buffer
access and improves performance in these cases. A heuristic method is
used to detect whether thread to core binding follows close pattern
before applying the thread id modifications.
AMD-Internal: [SWLCSG-3185]
Change-Id: Ie3c87d56e0f7b59161a381f382cf4e2d5d02a591