Better PP performance with split mode "graph" and 3+ GPUs (#1069)

* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-12-17 07:40:25 +01:00
committed by GitHub
parent 75de0528c3
commit 5585ac2aa8
6 changed files with 100 additions and 45 deletions

View File

@@ -212,6 +212,7 @@ extern "C" {
GGML_API void ggml_backend_sched_set_op_offload(ggml_backend_sched_t sched, enum ggml_op op, bool on_or_off);
GGML_API void ggml_backend_sched_set_only_active_experts(ggml_backend_sched_t sched, bool on_or_off);
GGML_API void ggml_backend_sched_set_split_mode_graph(ggml_backend_sched_t sched, bool on_or_off);
GGML_API void ggml_backend_sched_set_max_extra_alloc(ggml_backend_sched_t sched, int extra_alloc_MiB);
//
// Utils