mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-12 23:10:01 +00:00
Merge Q, K, V (#878)
* POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain. * WIP * merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%) * WIP * Don't ignore the return value of create_tensors() else, when q, k, v get merged and we are running on the CPU, we get a crash because the backend is trying to use mmap, but that no longer works. * merge_qkv: bias can be required, optional, or mandatory * merge_qkv: glm4.5moe * merge_qkv: add command loine argument to enable * merge_qkv: fix tensor dimensions * merge_qkv: llama-4 * merge_qkv: qwen3 (dense) * merge_qkv: simplify build_qwen3moe * cohere2 - simplify graph building --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
@@ -382,6 +382,7 @@ extern "C" {
|
||||
bool repack_tensors;// repack if available
|
||||
bool use_thp; // use transparent huge pages (linux only)
|
||||
bool validate_quants; // if true, check for NaNs while loading the model
|
||||
bool merge_qkv; // if true, merge separate Q, K, V tensors into a single, contiguous tensor
|
||||
};
|
||||
|
||||
// NOTE: changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations
|
||||
|
||||
Reference in New Issue
Block a user