amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-06-29 02:37:05 +00:00

Author	SHA1	Message	Date
Smyth, Edward	23b48bb999	Enable support for OpenMP 2.5 and earlier Add compatibility for OpenMP implementations (e.g., MSVC, older GCC) that lack functions introduced in OpenMP 3.0 i.e. omp_get_active_level() and omp_get_max_active_levels(). On these compilers, the tests instead are based on the older omp_get_nested() functionality. Thanks to @tony-davis for highlighting this issue. AMD-Internal: [CPUPL-7303]	2026-03-06 09:34:17 +00:00
Varaganti, Kiran	bb6545a46b	Added new thread control API with global and thread-local variants CPUPL-7578: New thread control API with global and thread-local variants Summary: Add new BLIS thread control APIs that provide fine-grained control over threading with proper global and thread-local (TLS) semantics. Fix several correctness issues where set_num_threads() and set_ways() did not properly override each other's state. New/Modified APIs: bli_thread_set_num_threads() — Sets thread count globally (updates both global_rntm and tl_rntm) bli_thread_set_num_threads_local() — Sets thread count for calling thread only (tl_rntm) bli_thread_get_num_threads() — Returns effective thread count, deriving from ways if set bli_thread_reset() — Resyncs tl_rntm from global_rntm bli_thread_set_ways() — Sets loop factorization (jc, pc, ic, jr, ir) bli_thread_get_is_parallel() — Returns whether parallelism is enabled bli_thread_get_jc_nt/ic_nt/pc_nt/jr_nt/ir_nt() — Returns individual way values b77_thread_set_num_threads_local_() — Fortran-compatible wrapper Bug fixes: bli_thread_set_num_threads() now clears ways (-1) and sets auto_factor=TRUE on both global_rntm and tl_rntm, so it properly overrides prior BLIS_JC_NT/BLIS_IC_NT environment settings bli_thread_set_ways() now propagates to global_rntm (inside mutex) and clears stale num_threads on both global_rntm and tl_rntm, so get_num_threads() returns the product of ways instead of a stale value Fix data race in bli_thread_init_rntm_from_global_rntm() — copy global_rntm under mutex before debug printing Fix data race in set_num_threads_local() debug print Test suite (43 tests, 106 assertions): test_thread_control.c (OpenMP, 23 tests): environment inheritance, global propagation, thread-local isolation, local precedence, per-thread local, reset, nested parallel, edge cases, set_ways, is_parallel, concurrent updates, DGEMM with threads, interleaved settings, persistence, parallel DGEMM, thread pool, reset-to-sync, env ways vs set_num_threads, ways→set_nt→reset, ways→local→reset, round-trip, set_nt→set_ways override, set_ways propagation to new threads test_thread_control_pthread.c (pthread, 20 tests): equivalent coverage plus concurrent set/reset race condition test, set_nt→set_ways override, set_ways propagation via pthread_create Files changed (9 files, +2630/-29 lines): bli_thread.c — Core API implementations and fixes bli_thread.h — New function declarations b77_thread.c — Fortran wrapper test_thread_control.c — OpenMP test suite (23 tests) test_thread_control_pthread.c — pthread test suite (20 tests) TEST_THREAD_CONTROL_README.md — Documentation AMD-Internal: CPUPL-7578	2026-03-06 12:16:17 +05:30
Varaganti, Kiran	713b09b407	Remove unnecessary barrier in sup path decorator to fix ~10% DGEMM regression The bli_thread_barrier(thread) call before bli_l3_sup_thrinfo_free() in bli_l3_sup_thread_decorator() was added by analogy with the conventional path's PR #702 fix, but is not needed in the sup (small/unpacked) path. In the conventional path, pack buffers are cached in the control tree (cntl_t->pack_mem) and freed in the decorator after func() returns. A barrier is required there to prevent a fast chief from releasing a pack buffer back to the PBA pool while slower peers in a different sub-group still read from it. The sup path does not have this problem because: 1. Pack buffers are stack-local variables (mem_t in var2m), freed inside func() by packm_sup_finalize_mem() after internal loop barriers. They are never freed in this decorator. 2. The global communicator (gl_comm) is freed outside the parallel region, protected by the implicit OpenMP barrier at the closing brace of the parallel construct. 3. Sub-group communicators (created when packa/packb is enabled) are freed only by the ochief thread in bli_thrinfo_free(). Non-chief threads never dereference the shared communicator — they only read their own ocomm_id and free_comm fields. When neither matrix is packed, no sub-communicators exist (ocomm=NULL, free_comm=FALSE). The custom spin-wait barrier (bli_thread_barrier) is significantly slower than the OpenMP runtime barrier at high thread counts, causing a ~10% DGEMM performance regression at 96 threads on AMD EPYC Turin (e.g. 11000x300x200 DGEMM). Ref: https://github.com/flame/blis/pull/702 Resolves: [CPUPL-7979] [SWLCSG-3951] [LWPHPCENGG-622]	2026-03-05 11:44:57 +05:30
Smyth, Edward	011c75dddb	Remove unnecessary OpenMP include (AOCL) Copy of similar change in upstream BLIS (843a5e8) to fix issues https://github.com/flame/blis/issues/873 and https://github.com/amd/blis/issues/50 Details: - Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the framework could access the necessary OpenMP functions. - As @melven reported (#873), this causes issues when `blis.h` is included in C++ code since the `<omp.h>` include happens with `extern "C"`. - Move the include from the header to the necessary .c files so that it does not "pollute" `blis.h`. Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in AOCL BLIS AMD-Internal: [CPUPL-7303]	2026-02-06 10:41:38 +00:00
Varaganti, Kiran	bbb7edcb22	thread: free global communicator after parallel region completes in p… * thread: free global communicator after parallel region completes in pthreads decorator Avoid potential data race by deferring free until all threads have joined. Previously, chief thread could free inside while non-chief threads still held pointers. Now, frees after the parallel region, following barrier and joins. Files: - frame/thread/bli_l3_sup_decor_pthreads.c - frame/thread/bli_l3_decor_pthreads.c * AMD-Internal: [CPUPL-7694]	2025-12-09 19:15:52 +05:30
Varaganti, Kiran	8a84b2fb2c	Global Communicator is now freed outside the parallel region * Global Communicator is now freed outside the parallel region Description: // Root threads don't "own" the global communicator thrinfo_t* root = bli_thrinfo_create_root(comm, id, pool, pba); // Setting free_comm=FALSE makes it clear: "This thread doesn't own this resource" The thread that creates the communicator should be responsible for freeing it: // Framework creates global communicator thrcomm_t* gl_comm = bli_thrcomm_create(n_threads); // Framework should clean it up, not individual threads // (Even though only chief would actually do the cleanup) Global communicators: Created by framework → free_comm=FALSE Local communicators: Created by threads → free_comm=TRUE Setting free_comm=FALSE provides an extra safety layer - if the chief thread logic ever changes, root threads won't accidentally try to free global communicators. The current implementation has the framework handle global communicator cleanup: // We shouldn't free the global communicator since it was already freed // by the global communicator's chief thread in bli_l3_thrinfo_free() Technically root threads could have free_comm=TRUE and still be safe due to the chief thread protection, but the current change uses FALSE for better semantic clarity and architectural consistency. Made all changes to align with this design. [CPUPL-7577] * Removed old comments * Applied similar changes to sequential code path * For single thread we use global BLIS_SINGLE_COMM variable instead of allocating memory from sba pool * Fixed comments * Cleanup comments	2025-12-05 15:52:08 +05:30
Varaganti, Kiran	9230c978a1	Fixed Data Race in Native code-path (#251 ) ```c _Pragma( "omp parallel num_threads(n_threads)" ) { // ... thread work ... // Free the current thread's thrinfo_t structure. bli_l3_thrinfo_free( rntm_p, thread ); // Line 183 } // * MISSING BARRIER HERE! * // Check the array_t back into the small block allocator... bli_sba_checkin_array( array ); // Line 200 ``` ```c // DANGEROUS execution timeline: Thread 0 (chief): completes func() calls bli_l3_cntl_free() calls bli_l3_thrinfo_free() → frees gl_comm ✓ exits OpenMP parallel region calls bli_sba_checkin_array(array) → frees array ✗ Thread 1,2,3 (still executing): still in func() or bli_l3_cntl_free() trying to access freed gl_comm → CRASH! trying to access freed array pools → CRASH! ``` This is exactly the same issue that PR #702 fixed in other files! The function needs a barrier before threads exit the parallel region to ensure: 1. All threads complete their work before any cleanup starts 2. Global communicator isn't freed while other threads are using it 3. Array pools aren't freed while other threads are accessing them	2025-11-07 10:49:19 +05:30
Varaganti, Kiran	7ac261b173	Replaced omp barrier with bli_thread_barrier and added similar fix fo… (#248 ) * Replaced omp barrier with bli_thread_barrier and added similar fix for pack compute routine ## The Sequence is Critical: 1. All threads execute `func(...)` - may access shared resources including communicators 2. Barrier - ensures ALL threads finish their work 3. Then each thread calls `bli_l3_sup_thrinfo_free()` - only chief actually frees gl_comm 4. Safe cleanup - no use-after-free because all threads are done using gl_comm ## What Would Happen Without the Barrier: ```c // Thread execution timeline WITHOUT barrier: Time 1: Thread 0 finishes func() early → immediately calls bli_l3_sup_thrinfo_free() Time 2: Thread 0 (chief) frees gl_comm Time 3: Thread 1,2,3... still in func() → try to use freed gl_comm → CRASH! ``` ## With the Barrier (Current Safe Code): ```c // Thread execution timeline WITH barrier: Time 1: All threads finish func() at different times Time 2: ALL threads reach barrier → wait for slowest thread Time 3: ALL threads proceed past barrier together Time 4: Chief frees gl_comm → safe because no one using it anymore ```	2025-10-31 10:01:40 +05:30
Dave, Harsh	90d252d59a	Add OpenMP barrier before releasing threadinfo & global communicator to avoid race (#225 ) - Added `#pragma omp barrier` just before threads start releasing their threadinfo / global communicator. - This ensures all threads reach this sync point, preventing interleaved cleanup. Co-authored-by: harsdave <harsdave@amd.com>	2025-10-24 16:22:45 +05:30
S, Hari Govind	a9df3fd8d5	Adding bli_print_msg before bli_abort() for bli_thrinfo_sup_create_for_cntl - Adding bli_print_msg to print failure message bout bli_abort in bli_thrinfo_sup_create_for_cntl function.	2025-09-19 10:53:50 +05:30
S, Hari Govind	08c757202d	Initialize mem_t structures safely and handle NULL communicator in threading - Explicitly initialize all fields of mem_t structures in bli_znormfv_unb_var1 and bli_dnormfv_unb_var1 to prevent undefined behavior when memory is not allocated. - Add a NULL check after bli_thread_broadcast() in bli_thrinfo_sup_create_for_cntl to ensure that the communicator is valid, and call bli_abort() if broadcast fails.	2025-09-17 14:10:37 +05:30
Smyth, Edward	b5c66a9d8c	Implement bli_thread_reset (#32 ) BLIS-specific setting of threading takes precedence over OpenMP thread count ICV values, and if the BLIS-specific threading APIs are used, there was no way for the program to revert to OpenMP settings. This patch implements a function bli_thread_reset() to do this. This is similar to that implemented in upstream BLIS in commit `6dcf7666ef` More specifically, it reverts the internal threading data to that which existed when the program was launched, subject where appropriate to any changes in the OpenMP ICVs. In other words: - It will undo changes to threading set by previous calls to bli_thread_set_num_threads or bli_thread_set_ways. - If the environment variable BLIS_NUM_THREADS was used, this will NOT be cleared, as the initial state of the program is restored. - Changes to OpenMP ICVs from previous calls to omp_set_num_threads() will still be in effect, but can be overridden by further calls to omp_set_num_threads(). Note: the internal BLIS data structure updated by the threading APIs, including bli_thread_reset(), is thread-local to each user (e.g. application) thread. Example usage: omp_set_num_threads(4); bli_thread_set_num_threads(7); dgemm(...); // 7 threads will be used bli_thread_reset(); dgemm(...); // 4 threads will be used	2025-06-17 10:40:10 +01:00
Edward Smyth	3c2dedb13c	Restore or add update from env in bli_thread_get APIs We want bli_thread_get_num_threads() and bli_thread_get__nt() to report the threading values modified to reflect what will be in effect given OpenMP nesting and active levels. This was lost in commit `0c6d006225` for bli_thread_get_num_threads() and wasn't previously implemented in bli_thread_get__nt() AMD-Internal: [CPUPL-6168] Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd	2025-01-20 08:58:22 -05:00
Edward Smyth	0c6d006225	Changes to rntm to reduce mutex operations Change usage of global_rntm and tl_rntm to elimate need for mutex operations when accessing global_rntm. Usage of these data structures is now as follows: * global_rntm is set once during bli_init_apis and includes all getenv calls to check BLIS threading and error printing environment variables. global_rntm is then read-only. * tl_rntm is intialized once from global_rntm on each application thread. Any calls to BLIS set threading/ways APIs will update tl_rntm for that application thread only (Previously they updated global_rntm for all application threads). * Re-initialize info_value in tl_rntm in every call to bli_init APIs. * In bli_rntm_init_from_global() we initialize the local (per API call) rntm as a copy of tl_rntm and then update threading values in bli_thread_update_rntm_from_env() to reflect the current status of OpenMP runtime ICVs. AMD-Internal: [CPUPL-6168][SWLCSG-3143] Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6	2024-12-16 04:45:26 -05:00
Kiran Varaganti	3e2795f406	OpenMP barrier overhead bug fix In the function bli_thread_update_rntm_from_env()mutex is used for reading global_rntm "bli_pthread_mutex_lock( &global_rntm_mutex );" This causes regression when application is Multithreaded. The cause of this regression is due to these mutexes, Imagine a scenario two threads launched, one thread acquires this mutex, second thread stalls till mutex is freed by first thread, as a result second thread will be slower to arrive at openmp barrier in application thereby increasing the openmp barrier overhead. Things get worst when more number of threads are launched. Thanks to rocHPL for sharing standalone panelfact application to reproduce this issue. Thanks to @Edward Symth (edward.smyth@amd.com) for finding this bug. [SWLCSG-3143]	2024-11-22 15:36:30 +05:30
Hari Govind S	6dd8f06aff	Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set - Reverted the change done for tuning ddotv API. When number of threads is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads are not calculated and as a result number of threads value is -1. OpenMP threads are launched with -1 value. This results in crash. This bug is fixed by correctly calculating number of threads. AMD-Internal: [SWLCSG-3028][CPUPL-5689] Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a	2024-08-29 05:42:03 -04:00
Edward Smyth	89f52a6df5	Code cleanup: spelling corrections Corrections for spelling and other mistakes in code comments and doc files. AMD-Internal: [CPUPL-4500] Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f	2024-08-05 16:18:51 -04:00
Edward Smyth	82bdf7c8c7	Code cleanup: Copyright notices - Standardize formatting (spacing etc). - Add full copyright to cmake files (excluding .json) - Correct copyright and disclaimer text for frame and zen, skx and a couple of other kernels to cover all contributors, as is commonly used in other files. - Fixed some typos and missing lines in copyright statements. AMD-Internal: [CPUPL-4415] Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371	2024-08-05 15:35:08 -04:00
Moripalli Chitra	cb915c241d	Tuning ddotv API - Modifying threading framework for L1 APIs to update only number of threads from runtime env and avoid overhead of reading other ICVs. - Removing bli_arch_set_id_once() from bli_arch_set_id_once() flow as bli_arch_check_id_once() calls it. AMD-Internal: [CPUPL-4877] Change-Id: I87b346825a96d74e746a41530b6d22ae162f19ba	2024-06-18 19:31:17 +05:30
Edward Smyth	62c886feee	Export some BLIS internal symbols AOCL libFLAME optimizations directly call some internal BLIS symbols. Export them to enable this to work with the BLIS shared library. AMD-Internal: [CPUPL-5044] Change-Id: Icb62dcb51e12d72dde8434593ab17de3c227c93d	2024-05-08 12:51:32 -04:00
Edward Smyth	ed5010d65b	Code cleanup: AMD copyright notice Standardize format of AMD copyright notice. AMD-Internal: [CPUPL-3519] Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0	2023-11-23 08:54:31 -05:00
Edward Smyth	f471615c66	Code cleanup: No newline at end of file Some text files were missing a newline at the end of the file. One has been added. AMD-Internal: [CPUPL-3519] Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce	2023-11-22 17:11:10 -05:00
Eashan Dash	e4e4fe55fb	Added Parameter Checks and DTL Trace for Extension APIs 1. Added input parameter checking for the extension APIs 1. gemm_pack_get_size API 2. gemm_pack API 2. Additionally added early returns for these APIs when m or n dimensions are 0. 3. Routines for input parameter check for all the 3 BLAS extension APIs - gemm_pack_get_size, gemm_pack and gemm_compute are defined in: frame/compat/check/bla_gemm_pack_compute_check.h 4. Added AOCL DTL TRACE for all the functions of 1. gemm_pack_get_size 2. gemm_pack 3. gemm_compute AMD-Internal: [CPUPL-3560] Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631	2023-11-09 18:53:59 +05:30
Eleni Vlachopoulou	75a4d2f72f	CMake: Adding new portable CMake system. - A completely new system, made to be closer to Make system. AMD-Internal: [CPUPL-2748] Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529	2023-11-09 15:49:45 +05:30
Edward Smyth	9500cbee63	Code cleanup: spelling corrections Corrections for some spelling mistakes in comments. AMD-Internal: [CPUPL-3519] Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba	2023-11-09 00:16:30 -05:00
Eashan Dash	c3d1a3878c	Parallelized Pack and Compute Extension APIs 1. OpenMP based multi-threading parallelism is added for BLAS extension APIs of Pack and Compute 2. Both pack and compute APIs are parallelized. 3. Multi-threading of pack and compute APIs done with different number of threads can lead to inconsistent results due to output difference of the full packed matrix buffer when packed with different number of threads. 4. In multi-threaded execution, we ensure output of packed buffer is exactly the same as in single threaded execution. 5. Similarly for compute API, read of packed buffer in multi- threaded execution is exactly the same as in single-threaded execution. 6. Routines are added to compute the offsets for thread workload distribution for MT execution. 1. The offsets are calculated in such a way that it resembles the reorder buffer traversal in single threaded reordering. 2. The panel boundaries (KCxNC) remain as it is accessed in single thread, and as a consequence a thread with jc_start inside the panel cannot consider NC range for reorder. 3. It has to work with NC' < NC, and the offset is calulated using prev NC panels spanning k dim + cur NC panel spaning pc loop cur iteration + (NC - NC') spanning current kc0 (<= KC). 7. Routines to ensure the same are added for MT execution 1. frame/base/bli_pack_compute_utils.c 2. frame/base/bli_pack_compute_utils.h AMD-Internal: [CPUPL-3560] Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac	2023-11-03 08:47:17 -04:00
Edward Smyth	6d0444497f	Improvements to xerbla functionality The following improvements have been implemented: - Option to stop in xerbla on error. This is controlled by setting the environment variable BLIS_STOP_ON_ERROR=1 - Option to disable printing of error message from BLIS. This is controlled by setting the environment variable BLIS_PRINT_ON_ERROR=0 - Added a function to return the value of INFO passed to xerbla, assuming xerbla was not set to stop on error. Example call is info = bli_info_get_info_value(); The default behaviour remains to print but don't stop on error, i.e. the equivalent to export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0 Implementation details: - Values of the environment variables are stored and retrieved from global_rntm. - Info value is stored and retrieved from tl_rntm. It is set to 0 during initialization for all calls and updated by xerbla if an error has occurred. - Call to bli_init_auto before calling PASTEBLACHK macro (which calls xerbla) will reinitialize info_value to 0 via call to bli_thread_update_rntm_from_env AMD-Internal: [CPUPL-3520] Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d	2023-10-16 08:48:51 -04:00
Arnav Sharma	c8f14edcf5	BLAS Extension API - ?gemm_compute() - Added support for 2 new APIs: 1. sgemm_compute() 2. dgemm_compute() These are dependent on the ?gemm_pack_get_size() and ?gemm_pack() APIs. - ?gemm_compute() takes the packed matrix buffer (represented by the packed matrix identifier) and performs the GEMM operation: C := A * B + beta * C. - Whenever the kernel storage preference and the matrix storage scheme isn't matching, and the respective matrix being loaded isn't packed either, on-the-go packing has been enabled for such cases to pack that matrix. - Note: If both the matrices are packed using the ?gemm_pack() API, it is the responsibility of the user to pack only one matrix with alpha scalar and the other with a unit scalar. - Note: Support is presently limited to Single Thread only. Both, pack and compute APIs are forced to take n_threads=1. AMD-Internal: [CPUPL-3560] Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158	2023-10-16 08:18:52 -04:00
Vignesh Balasubramanian	81161066e5	Multithreading the DNRM2 and DZNRM2 API - Updated the bli_dnormfv_unb_var1( ... ) and bli_znormfv_unb_var1( ... ) function to support multithreaded calls to the respective computational kernels, if and when the OpenMP support is enabled. - Added the logic to distribute the job among the threads such that only one thread has to deal with fringe case(if required). The remaining threads will execute only the AVX-2 code section of the computational kernel. - Added reduction logic post parallel region, to handle overflow and/or underflow conditions as per the mandate. The reduction for both the APIs involve calling the vectorized kernel of dnormfv operation. - Added changes to the kernel to have the scaling factors and thresholds prebroadcasted onto the registers, instead of broadcasting every time on a need basis. - Non-unit stride cases are packed to be redirected to the vectorized implementation. In case the packing fails, the input is handled by the fringe case loop in the kernel. - Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... ) and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe cases of size = 2 ( and ) size = 1 or non-unit strides respectively. AMD-Internal: [CPUPL-3916][CPUPL-3633] Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d	2023-10-16 07:26:27 -04:00
eashdash	30bdeecbcc	Added BLAS Extension APIs - Get Size and Pack API 1. 4 new APIs are added to support packed compute GEMM operations 1. dgemm_pack_get_size 2. sgemm_pack_get_size 3. dgemm_pack 4. sgemm_pack 2. Pack_get_size API 1. Returns size in bytes required for packing of input 2. Requires identifier to identify the input matrix to be packed 3. Additionally requires 3 integer parameters for input dimensions 3. Packed buffer is allocated using the pack size computed 4. Pack API: 1. Performs full matrix packing of the input 2. Additionally, performs the alpha scaling 3. Packed buffer created contains the full packed matrix 5. The GEMM compute calls are required to be operated on the packed buffer with alpha = 1 since alpha scaling is already done by the Pack API 6. GEMM Pack API eliminate the cost of packing the input matrixes by avoiding on the go pack in the GEMM 5 loop. Packing of input matrixes are done when there is resue of matrixes across different GEMM calls. AMD-Internal: [CPUPL-3560] Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3	2023-10-04 06:43:59 -04:00
Edward Smyth	bb4c158e63	Merge commit 'b683d01b' into amd-main * commit 'b683d01b': Use extra #undef when including ba/ex API headers. Minor preprocessor/header cleanup. Fixed typo in cpp guard in bli_util_ft.h. Defined eqsc, eqv, eqm to test object equality. Defined setijv, getijv to set/get vector elements. Minor API breakage in bli_pack API. Add err_t* "return" parameter to malloc functions. Always stay initialized after BLAS compat calls. Renamed membrk files/vars/functions to pba. Switch allocator mutexes to static initialization. AMD-Internal: [CPUPL-2698] Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df	2023-08-21 07:01:38 -04:00
Shubham Sharma	0000cc88de	Removed local copy of cntx in TRSM - TRSM and GEMM has different blocksizes in zen4, in order to accommodate this, a local copy of cntx was created in TRSM. - Local copy of cntx has been removed and TRSM blocksizes are stored in cntx->trsmblkszs. - Functions to override and restore default blocksizes for TRSM are removed. Instead of overriding the default blocksizes, TRSM blocksizes are stored separately in cntx. - Pack buffers for TRSM have to be packed with TRSM blocksizes and GEMM pack buffers have to be packed with default blocksizes. To check if we are packing for TRSM, "family" argument is added in bli_packm_init_pack function. - BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if it is not set then BLIS_GEMM_UKR has to be used. This functionality has been added to all TRSM macro kernels. - Methods to retrieve TRSM blocksizes from cntx are added to bli_cntx.h. - Tests for micro kernels are modified to accommodate the change in signature of bli_packm_init_pack. AMD-Internal: [CPUPL-3781] Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a	2023-08-16 08:09:01 -04:00
Harihara Sudhan S	9272d3c778	Bug fix in work load distribution among the given threads - In level-1 kernels, with multi-threading enabled, only the partial job was getting executed. - The bug was in bli_thread_vector_partition and occurred only when minimum work for a thread >= 1 i.e., when the number of threads launched is less than number of elements and the number of elements is not a multiple of the number of threads launched. AMD-Internal: [CPUPL-3231] Change-Id: Ie20abb93468282cd6ac2372267714fb80c26d7cc	2023-04-18 10:16:09 -04:00
Harihara Sudhan S	32bbd96652	Moving AOCL Dynamic logic from BLIS impli layer Threading related changes -------------------------- - Created function bli_nthreads_l1 that dispatches the AOCL dynamic logic for a L1 function based on the kernel ID and input datatypes. - bli_nthreads_l1 gets the number of threads to be launched from the rntm variable. - Added aocl_'ker?'_dynamic function for DAXPYV, DSCALV, ZDSCALV and DDOTV. This function contains the AOCL dynamic logic for the respective kernels. - Added handling for cases when number of elements (n) is less than number of threads spawned (nt) in AOCL dynamic. - Added function bli_thread_vector_partition that calculates the amount of work the calling thread is supposed to perform on a vector. Interface changes ----------------- - In BLIS impli layer of DSCALV, ZDSCALV and AXPYV, added logic to pick kernel based on architecture ID and removed AVX2 flag check. - Modified function signature of ZDSCALV. Alpha is passed as dcomplex and only the real part of the alpha passed is used inside the kernel. The change was done to facilitate kernel dispatch based on arch ID. - Added n <= 0, BLAS exception in BLAS layer of DAXPYV and DDOTV. Without this multithreaded code might crash because of minimum work calculation. Misc ----- - Removed unused variables from ZSCAL2V and AXPYV kernels. AMD-Internal: [CPUPL-3095] Change-Id: I4fc7ef53d21f2d86846e86d88ed853deb8fe59e9	2023-04-14 02:05:38 -04:00
Edward Smyth	82c2eb4e8e	Code cleanup and warnings fixes Corrections for some occurances of: - Compiler warnings about initialization of float from double - Spelling mistakes in comments - Incorrect indentation of code and comments AMD-Internal: [CPUPL-2870] Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc	2023-01-09 04:34:52 -05:00
Edward Smyth	2592774fe8	BLIS: Nested Parallelism issues (3) Bugfix for parallel BLAS1 and BLAS2 routines. Threading information was not being set correctly when initializing local rntm from global. Also ensure th_rntm is initialized along with global_rntm by updating it in bli_thread_init(), called by bli_init_once() AMD-Internal: [CPUPL-2433] Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf	2022-12-13 12:38:40 -05:00
Edward Smyth	345aacf806	BLIS: Nested parallelism issues (2) Improvements to recent parallelism changes: 1. BLIS specific threading options: In bli_thread_update_rntm_from_env() set threading variables in tl_rntm to serial values when OpenMP level for parallelism within BLIS will not be active. User supplied BLIS threading values remain unchanged in global_rntm. 2. Simplify code structure in bli_thread_update_rntm_from_env(). 3. Change variable declarations in bli_thread_init_rntm_from_env() and bli_thread_update_rntm_from_env() to avoid unused variable warnings in non-OpenMP builds. AMD-Internal: [CPUPL-2433] Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd	2022-12-07 04:34:07 -05:00
Edward Smyth	34730a1e4c	BLIS: Nested parallelism issues 1. Check OpenMP active level against max active levels when setting number of threads for starting a new parallel region in ./frame/thread/bli_thread.c to ensure the correct number of threads is used when BLIS is called within nested OpenMP parallelism. 2. In subsequent BLIS calls, threading choices could be incorrectly set based on values used and stored in global_rntm by a previous call. This could apply when the OpenMP number of threads differ from call to call, different nested parallelism is used in different parts of a user's code, or different threads at the user level request different numbers of OpenMP threads for BLIS calls. Keep threading information in both global_rntm and a new Thread Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime environment (as appropriate) during bli_init_auto() calls in each BLIS routine. The details are: * global_rntm is initialized on first BLIS call based on OpenMP and BLIS threading environment variables. * global_rntm is updated by any BLIS threading function calls. * In bli_thread_update_tl(), called by bli_init_auto(), sync with any BLIS values set or updated in global_rntm. Then, if BLIS threading control is not used, check OpenMP ICVs and set thread count and auto_factor appropriately. * Setting BLIS threading locally (using expert interfaces to pass a user defined rntm data structure) should work as before. 3. bli_thread_get_is_parallel can now only be called outside of parallelism within BLIS routines. Change calls in trsm to reflect this. 4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env() if any BLIS_*_NT environment variables are set. 5. Set auto_factor = FALSE when the number of threads is 1. 6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE. 7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init(). 8. For debugging, internal information on the rntm threading data can be printed by defining "PRINT_THREADING" at the top of bli_rntm.h 9. bli_rntm_print() now also prints the value of blis_mt. 10. Function prototypes in bli_rntm.h moved to top of file, so that bli_rntm_print() can be used within inline functions defined in this header file. 11. Comment out bli_init_auto() and bli_finalize_auto() calls in Fortran interfaces in frame/compat/blis/thread/b77_thread.c 12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and set_pack_b functions outside of the auto_factor if statements. 13. Misc code tidying. AMD-Internal: [CPUPL-2433] Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee	2022-10-21 07:38:39 -04:00
Sireesha Sanga	22af681a11	Runtime Thread Control Feature Update Details: 1. Runtime Thread Control Feature is enhanced to create a provision for the application to allocate a different number of threads to BLIS from the number of threads application is using for itself. 2. In the previous implementation, if application sets BLIS_NUM_THREADS with a valid value, BLIS internally calls omp_set_num_threads() API with same value. Due to this, application could not differentiate between the number of threads used in BLIS library and the application. 3. With the current solution, if Application wants to allocate different number of threads for BLIS API and application, Application can choose either BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API for BLIS, and OpenMP APIs or environment variables for itself, respectively. 4. If BLIS_NUM_THREADS is set with a valid value, same value will be used in the subsequent parallel regions unless bli_thread_set_num_threads() API is used by the Application to modify the desired number of threads during BLIS API execution. 5. Once BLIS_NUM_THREADS environment variable or bli_thread_set_num_threads(nt) API is used by the application, BLIS module would always give precedence to these values. BLIS API would not consider the values set using OpenMP API omp_set_num_threads(nt) API or OMP_NUM_THREADS environment variable. 6. If BLIS_NUM_THREADS is not set, then if Application is multithreaded and issued omp_set_num_threads(nt) with desired number of threads, omp_get_max_threads() API will fetch the number of threads set earlier. 7. If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called by the application, but only OMP_NUM_THREADS is set, omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS. 8. If both environment variables are not set, or if they are set with invalid values, and omp_set_num_threads(nt) is not issued by application, omp_get_max_threads() API will return the number of the cores in the current context. 9. BLIS will initialize rntm->num_threads with the same value. However if omp_set_nested is false - BLIS APIs called from parallel threads will run in sequential. But if nested parallelism is enabled Then each application will launch MT BLIS. 10. Order of precedence used for number of threads: 0. value set using bli_thread_set_num_threads(nt) by the application 1. valid value set for BLIS_NUM_THREADS environment variable 2. omp_set_num_threads(nt) issued by the application 3. valid value set for OMP_NUM_THREADS environment variable 4. Number of cores 11. If nt is not a valid value for omp_set_num_threads(nt) API, number of threads would be set to 1. omp_get_max_threads() API will return 1. 12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled. AMD-Internal: [CPUPL-2342] Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6	2022-08-19 05:43:01 -04:00
mkadavil	31f8820bab	Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path. - auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at runtime. Currently the auto_factor is enabled if num_threads > 0 and not reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm. This results in gemm/gemmt SUP path applying 2x2 factorization of num_threads, and thereby modifying the preset factorization. This issue is not observed in native path since factorization happens without checking auto_factor value. - Setting omp threads to n_threads using omp_set_num_threads after the global_rntm n_threads update in bli_thread_set_num_threads. This ensures that in bli_rntm_init_from_global, omp_get_max_threads returns the same value as set previously. AMD-Internal: [CPUPL-2137] Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426	2022-05-17 18:10:40 +05:30
Dipal M Zambare	e712ffe139	Added AOCL progress support for BLIS -- AOCL libraries are used for lengthy computations which can go on for hours or days, once the operation is started, the user doesn’t get any update on current state of the computation. This (AOCL progress) feature enables user to receive a periodic update from the libraries. -- User registers a callback with the library if it is interested in receiving the periodic update. -- The library invokes this callback periodically with information about current state of the operation. -- The update frequency is statically set in the code, it can be modified as needed if the library is built from source. -- These feature is supported for GEMM and TRSM operations. -- Added example for GEMM and TRSM. -- Cleaned up and reformatted test_gemm.c and test_trsm.c to remove warnings and making indentation consistent across the file. AMD-Internal: [CPUPL-2082] Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e	2022-05-17 18:10:39 +05:30
Sireesha Sanga	6a2c4acc66	Runtime Thread Control using OpenMP API Details: - During runtime, Application can set the desired number of threads using standard OpenMP API omp_set_num_threads(nt). - BLIS Library uses standard OpenMP API omp_get_max_threads() internally, to fetch the latest value set by the application. - This value will be used to decide the number of threads in the subsequent BLAS calls. - At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable will be given precedence, over the OpenMP standard API omp_set_num_threads(nt) and OMP_NUM_THREADS environment variable. - Order of precedence followed during BLIS Initialization is as follows 1. Valid value of BLIS_NUM_THREADS 2. omp_set_num_threads(nt) 3. valid value of OMP_NUM_THREADS 4. Number of cores - After BLIS initialization, if the Application issues omp_set_num_threads(nt) during runtime, number of threads set during BLIS Initialization, is overridden by the latest value set by the Application. - Existing precedence of BLIS_*_NT environment variables and the decision of optimal number of threads over the number of threads derived from the above process remains as it is. AMD-Internal: [CPUPL-2076] Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8	2022-05-17 18:09:22 +05:30
mkadavil	457c33a601	Eliminating barriers in SUP path when matrices are not packed. -Current gemm SUP path uses bli_thrinfo_sup_grow, bli_thread_range_sub to generate per thread data ranges at each loop of gemm algorithm. bli_thrinfo_sup_grow involves usage of multiple barriers for cross thread synchronization. These barriers are necessary in cases where either the A or B matrix are packed for centralized pack buffer allocation/deallocation (bli_thread_am_ochief thread). -However for cases where both A and B matrices are unpacked, these barrier are resulting in overhead for smaller dimensions. Here creation of unnecessary communicators are avoided and subsequently the requirement for barriers are eliminated when packing is disabled for both the input matrices in SUP path. Change-Id: Ic373dfd2d6b08b8f577dc98399a83bb08f794afa	2022-01-06 01:56:43 -05:00
Kiran Varaganti	d26089c665	Multi-threaded BLIS - OpenMP Apart from "BLIS_NUM_THREADS" or OMP_NUM_THREADS, number of threads can also be set by the application by calling omp_set_num_threads(int ); In the function "bli_thread_init_rntm_from_env()" when environment variabes are not set, number of threads is inferred by calling the API - omp_get_max_threads(). Now by default if OMP_NUM_THREADS or BLIS_NUM_THREADS are not set - it will run with omp_get_max_threads() threads. This feature is only enabled when BLIS is configured with openmp parallelization. Change-Id: Ic2b48bfcd33368e14758f2bb914c1545f7b0c3e6	2021-06-17 05:17:37 -04:00
Kiran Varaganti	c2abbcab96	Fix dgemm_ Multi-thread running as Single Thread Details: When parallelization is enabled in BLIS through enviroment varaibles BLIS_?C_NT or BLIS_?R_NT - dgemm_ is running as Single thread. This is fixed. Reason: when OMP_NUM_THREADS or BLIS_NUM_THREADS is not set num_threads paramenter in rntm is -1 irrespective of BLIS_IC_NT or BLIS_JC_NT values, as a result in dgemm_ interface it assumes single thread and calls small_gemm which ends up running sequentially. Fix: added a new function bli_thread_is_parallel() in bli_thread.c it returns 1 if parallelization is enabled either through BLIS_?C_NT values or BLIS_NUM_THREADS. It returns zero if sequential dgemm is needed. This function is called from dgemm_ to decide whether to call parallel dgemm_ or sequential one. Add fix for zgemm_ also. Change-Id: Ia3064647fdd977cf7531ed52191a5a9704478573	2021-06-15 12:14:11 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Field G. Van Zee	09bd4f4f12	Add err_t* "return" parameter to malloc functions. Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.	2021-03-31 17:09:36 -05:00
Field G. Van Zee	3a6f41afb8	Renamed membrk files/vars/functions to pba. Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.	2021-03-27 17:22:14 -05:00
Field G. Van Zee	36cb4116d1	Switch allocator mutexes to static initialization. Details: - Switched the small block allocator (sba), as defined in bli_sba.c and bli_apool.c, to static initialization of its internal mutex. Did a similar thing for the packing block allocator (pba), which appears as global_membrk in bli_membrk.c. - Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex() to ensure they won't be used in the future. - In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp blocks guarded by BLIS_USE_PTHREAD_MUTEX.	2021-03-27 15:15:09 -05:00
Field G. Van Zee	a4b73de84c	Disabled _self() and _equal() in bli_pthread API. Details: - Disabled the _self() and _equal() extensions to the bli_pthread API introduced in d479654. These functions were disabled after I realized that they aren't actually needed yet. Thanks to Devin Matthews for helping me reason through the appropriate consumer code that will appear in BLIS (eventually) in a future commit. (Also, I could never get the Windows branch to link properly in clang builds in AppVeyor. See the comment I left in the code, and #485, for more info.)	2021-03-12 19:47:39 -06:00

1 2 3 4

165 Commits