mirror of
https://github.com/amd/blis.git
synced 2026-04-19 23:28:52 +00:00
Data Race and Barrier Fixes
* Replaced omp barrier with bli_thread_barrier and added similar fix fo… (#248) * Replaced omp barrier with bli_thread_barrier and added similar fix for pack compute routine ## **The Sequence is Critical:** 1. **All threads execute `func(...)`** - may access shared resources including communicators 2. **Barrier ** - ensures ALL threads finish their work 3. **Then each thread calls `bli_l3_sup_thrinfo_free()`** - only chief actually frees gl_comm 4. **Safe cleanup** - no use-after-free because all threads are done using gl_comm ## **What Would Happen Without the Barrier:** ```c // Thread execution timeline WITHOUT barrier: Time 1: Thread 0 finishes func() early → immediately calls bli_l3_sup_thrinfo_free() Time 2: Thread 0 (chief) frees gl_comm Time 3: Thread 1,2,3... still in func() → try to use freed gl_comm → CRASH! ``` ## **With the Barrier (Current Safe Code):** ```c // Thread execution timeline WITH barrier: Time 1: All threads finish func() at different times Time 2: ALL threads reach barrier → wait for slowest thread Time 3: ALL threads proceed past barrier together Time 4: Chief frees gl_comm → safe because no one using it anymore ``` * Fixed Data Race in Native code-path (#251) ```c _Pragma( "omp parallel num_threads(n_threads)" ) { // ... thread work ... // Free the current thread's thrinfo_t structure. bli_l3_thrinfo_free( rntm_p, thread ); // Line 183 } // *** MISSING BARRIER HERE! *** // Check the array_t back into the small block allocator... bli_sba_checkin_array( array ); // Line 200 ``` ```c // DANGEROUS execution timeline: Thread 0 (chief): completes func() calls bli_l3_cntl_free() calls bli_l3_thrinfo_free() → frees gl_comm ✓ exits OpenMP parallel region calls bli_sba_checkin_array(array) → frees array ✗ Thread 1,2,3 (still executing): still in func() or bli_l3_cntl_free() trying to access freed gl_comm → CRASH! trying to access freed array pools → CRASH! ``` This is **exactly the same issue** that PR #702 fixed in other files! The function needs a barrier before threads exit the parallel region to ensure: 1. **All threads complete their work** before any cleanup starts 2. **Global communicator isn't freed** while other threads are using it 3. **Array pools aren't freed** while other threads are accessing them
This commit is contained in:
@@ -112,7 +112,12 @@ void bli_l3_compute_thread_decorator
|
||||
thread
|
||||
);
|
||||
|
||||
// Free the current thread's thrinfo_t structure.
|
||||
// NOTE: The barrier here is very important as it prevents memory being
|
||||
// released by the chief of some thread sub-group before its peers are done
|
||||
// using it. See PR #702 for more info [1].
|
||||
// [1] https://github.com/flame/blis/pull/702
|
||||
bli_thread_barrier( thread );
|
||||
// Free the current thread's thrinfo_t structure
|
||||
bli_l3_sup_thrinfo_free( rntm_p, thread );
|
||||
}
|
||||
|
||||
|
||||
@@ -179,6 +179,13 @@ void bli_l3_thread_decorator
|
||||
#ifdef PRINT_THRINFO
|
||||
threads[tid] = thread;
|
||||
#else
|
||||
|
||||
// NOTE: The barrier here is very important as it prevents memory being
|
||||
// released by the chief of some thread sub-group before its peers are done
|
||||
// using it. See PR #702 for more info [1].
|
||||
// [1] https://github.com/flame/blis/pull/702
|
||||
bli_thread_barrier( thread );
|
||||
|
||||
// Free the current thread's thrinfo_t structure.
|
||||
bli_l3_thrinfo_free( rntm_p, thread );
|
||||
#endif
|
||||
|
||||
@@ -121,8 +121,11 @@ err_t bli_l3_sup_thread_decorator
|
||||
thread
|
||||
);
|
||||
|
||||
/* Synchronize all threads */
|
||||
#pragma omp barrier
|
||||
// NOTE: The barrier here is very important as it prevents memory being
|
||||
// released by the chief of some thread sub-group before its peers are done
|
||||
// using it. See PR #702 for more info [1].
|
||||
// [1] https://github.com/flame/blis/pull/702
|
||||
bli_thread_barrier( thread );
|
||||
// Free the current thread's thrinfo_t structure.
|
||||
bli_l3_sup_thrinfo_free( rntm_p, thread );
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user