From 5992d2652bb7abd89f526dd2988162a29bcf32b3 Mon Sep 17 00:00:00 2001 From: Kawrakow Date: Wed, 24 Jul 2024 07:57:47 +0200 Subject: [PATCH] ggml: thread syncronization on Arm For x86 slaren was genereous enough to add _mm_pause() to the busy spin wait loop in ggml_barrier(), but everything else just busy spins, loading an atomic int on every iteration, thus forcing cache sync between the cores. This results in a massive drop in performance on my M2-Max laptop when using 8 threads. The closest approximation to _mm_pause() on Arm seems to be __asm__ __volatile__("isb\n"); After adding this to the busy spin loop, performance for 8 threads recovers back to expected levels. --- ggml.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ggml.c b/ggml.c index 9a83059d..0f68cea3 100644 --- a/ggml.c +++ b/ggml.c @@ -19142,6 +19142,8 @@ static void ggml_barrier(struct ggml_compute_state * state) { } #if defined(__SSE3__) _mm_pause(); + #elif defined __ARM_NEON + __asm__ __volatile__("isb\n"); #endif } sched_yield();