Enable faster prompt processing with mainline llama.cpp GGUFs (#409)

* Enable MLA-3 in crippled GGUFs: WIP

* Enable MLA-3 in crippled GGUFs: seems to work

* Add newly created tensors to model.tensors_by_name

Else they don't get run-time repacked.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-05-12 07:49:51 +03:00
committed by GitHub
parent 0c02e16a39
commit 2e585d4508
3 changed files with 294 additions and 140 deletions

View File

@@ -325,6 +325,7 @@ extern "C" {
struct llama_model_params {
int32_t n_gpu_layers; // number of layers to store in VRAM
int32_t mla; // MLA implementation to use (only applicable to DeepSeek models at this point)
enum llama_split_mode split_mode; // how to split the model across multiple GPUs
// main_gpu interpretation depends on split_mode: