Better PP performance with split mode "graph" and 3+ GPUs (#1069)

* This should do the trick for PP * Command line option to set max. extra VRAM that the scheduler can use * Fix bug and cleanup * Looks like with this change it is working with tensor overrides * Nah, it is not working * OK, this seems to be working * Disable split scheduling with tensor overrides --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-05-01 03:41:53 +00:00 · 2025-12-17 07:40:25 +01:00
parent 75de0528c3
commit 5585ac2aa8
6 changed files with 100 additions and 45 deletions
--- a/common/common.h
+++ b/common/common.h
@@ -167,6 +167,7 @@ struct gpt_params {
    float   yarn_beta_slow        =  -1.0f; // YaRN high correction dim
    int32_t yarn_orig_ctx         =     0; // YaRN original context length
    float   defrag_thold          = -1.0f; // KV cache defragmentation threshold
+    int32_t max_extra_alloc_MiB   = 256;   // additional VRAM per GPU the scheduler may allocate for more efficient compute graph evaluation

    ggml_backend_sched_eval_callback cb_eval = nullptr;
    void * cb_eval_user_data                 = nullptr;