Split mode "graph" for Cohere2 (#1061)

* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-12-13 20:30:08 +01:00
committed by GitHub
parent 5645be6cfc
commit f90d1fdd06
10 changed files with 211 additions and 107 deletions

View File

@@ -1729,6 +1729,7 @@ static bool is_model_split_supported(const llama_model & model) {
LLM_ARCH_QWEN3MOE,
LLM_ARCH_GLM4_MOE,
LLM_ARCH_MISTRAL3,
LLM_ARCH_COHERE2,
};
auto it = k_supported.find(model.arch);
return it != k_supported.end();