Better CPU prompt processing performance for SWA models (#696)

* This does the trick for PP * Compute mask bounds when creating the mask * Set mask bounds for all supported SWA models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-04-30 19:31:48 +00:00 · 2025-08-17 10:30:27 +03:00
parent 4bf5c8184b
commit 93a4f6089f
5 changed files with 140 additions and 30 deletions
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2043,6 +2043,10 @@ extern "C" {
            struct ggml_tensor * a,
            struct ggml_tensor * sinks);

+    GGML_API void ggml_flash_attn_ext_add_bounds(
+            struct ggml_tensor * a,
+            struct ggml_tensor * bounds);
+
    // TODO: needs to be adapted to ggml_flash_attn_ext
    GGML_API struct ggml_tensor * ggml_flash_attn_back(
           struct ggml_context * ctx,