Reduce size of compute buffers (#237)

* This reduces compute buffer size for MLA * This should accomplish it for standard attention * Much better * Better concat for contiguous tensors If all the op does is to concatenate the second tensor to the first, why would we want to have a loop? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-04-28 18:32:04 +00:00 · 2025-03-01 08:25:27 +02:00
parent 472b4c37c1
commit e787c00141
7 changed files with 236 additions and 79 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -384,6 +384,7 @@ extern "C" {
        bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU
        bool flash_attn;  // whether to use flash attention [EXPERIMENTAL]
        int  mla_attn;    // whether to use MLA attention [EXPERIMENTAL]
+        int  attn_max_batch;    // maximum batch size for attention computations [EXPERIMENTAL]
        bool fused_moe_up_gate; // whether to use fused MoE up/down op [EXPERIMENTAL]

        // Abort callback