amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-03 05:51:13 +00:00

Author	SHA1	Message	Date
Bhaskar, Nallani	b052775644	Disabled topology detection in LPGEMM - Disabled topology detection as libgomp is not honoring the standard function omp_get_place_proc_ids - Added B prefetch in bf16 B packing kernels AMD-Internal: SWLCSG-3761	2025-08-26 14:50:01 +01:00
Balasubramanian, Vignesh	1847a1e8c6	Bugfix : Segmentation fault at the topology detection layer (#51 ) - The current implementation of the topology detector establishes a contingency, wherein it is expected that the parallel region uses all the threads queried through omp_get_max_threads(). In case the actual parallelism in the function is limited(lower than this expectation), the code may access unallocated memory section (using uninitialized pointers). - This was because every thread(having it's own pointer), sets its initial value to NULL inside the parallel section, thereby leaving some pointers uninitialized if the associated thread is not spawned. - Also, the current implementation would use negative indexing(with -1) if any associated thread was not spawned. - Fix : Set every thread-specific pointer to NULL outside the parallel region, using calloc(). As long as we have NULL checks for pointers before accessing through them, no issues will be observed. Avoid incurring the topology detection cost if all the reuqired threads are not spawned(thereby avoiding potential negative indexing). (when using core-group ID). AMD-Internal: [SWLCSG-3573] Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com> Co-authored-by: Bhaskar, Nallani <Nallani.Bhaskar@amd.com>	2025-06-14 21:55:02 +05:30
Edward Smyth	1f0fb05277	Code cleanup: Copyright notices (2) More changes to standardize copyright formatting and correct years for some files modified in recent commits. AMD-Internal: [CPUPL-5895] Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12	2025-02-07 05:41:44 -05:00
Mithun Mohan	7a25505f5c	Simulation of spread like pattern in worker thread to core binding in LPGEMM. -In multi-threaded cases if a packed/close pattern thread to core binding is used (e.g.: OMP_PROC_BIND=close and OMP_PLACES=core\|threads), LPGEMM (OMP framework) launches threads such that threads with adjacent id's are bound to nearby (even adjacent) cores. Depending on the processor architecture, multiple threads with adjacent id's can be bound to cores sharing the same last level cache. However it was observed that when these threads (with adjacent id's) access the B reorder buffer, the last level cache access was suboptimal. This can be attributed to the per thread reorder buffer block accesses and how it maps to the last level cache. -In these cases, m is small (<= 4 * MR) and n value is such that number of NR blocks (n/NR) is less than available threads nt (like < 0.5 * nt). In such cases, id's of the threads can be modified such that the number of threads with adjacent id's bound to the last level cache can be reduced. This looks similar to the spread pattern used in thread to core binding. This reduces the load on last level cache due to reorder buffer access and improves performance in these cases. A heuristic method is used to detect whether thread to core binding follows close pattern before applying the thread id modifications. AMD-Internal: [SWLCSG-3185] Change-Id: Ie3c87d56e0f7b59161a381f382cf4e2d5d02a591	2025-01-10 06:02:06 -05:00

4 Commits