Merge Q, K, V (#878)

* POC: merge Q, K, V into a single, contiguous tensor

Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.

* WIP

* merge_qkv: it works for gpt-oss

...but we see a smaller TG gain (~1.5%)

* WIP

* Don't ignore the return value of create_tensors()

else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.

* merge_qkv: bias can be required, optional, or mandatory

* merge_qkv: glm4.5moe

* merge_qkv: add command loine argument to enable

* merge_qkv: fix tensor dimensions

* merge_qkv: llama-4

* merge_qkv: qwen3 (dense)

* merge_qkv: simplify build_qwen3moe

* cohere2 - simplify graph building

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-10-30 10:49:48 +02:00
committed by GitHub
parent 92517e74ad
commit 56fc5454ff
10 changed files with 260 additions and 119 deletions

View File

@@ -1684,7 +1684,7 @@ static bool llm_load_tensors(
throw std::runtime_error("model has expert layers but no expert layers are used");
}
cth->create_tensors();
use_mmap_buffer = cth->create_tensors();
ml.done_getting_tensors();
@@ -1896,7 +1896,7 @@ static bool llm_load_tensors(
static int llama_model_load(const std::string & fname, llama_model & model, llama_model_params & params) {
try {
llama_model_loader ml(fname, params.use_mmap, params.check_tensors,
params.repack_tensors, params.use_thp, params.kv_overrides, params.tensor_buft_overrides);
params.repack_tensors, params.use_thp, params.merge_qkv, params.kv_overrides, params.tensor_buft_overrides);
model.hparams.vocab_only = params.vocab_only;
@@ -3788,6 +3788,7 @@ struct llama_model_params llama_model_default_params() {
/*.repack_tensors =*/ false,
/*.use_thp =*/ false,
/*.validate_quants =*/ false,
/*.merge_qkv =*/ false,
};
#ifdef GGML_USE_METAL