RoPE cache (#887)

* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-11-03 18:42:20 +02:00
committed by GitHub
parent 846e736e85
commit fb0d5a995c
12 changed files with 1002 additions and 72 deletions

View File

@@ -427,6 +427,7 @@ extern "C" {
bool grouped_expert_routing; // whether to use grouped expert routing (BailingMoeV2 arch)
bool fused_up_gate; // whether to use fused up/gate op [EXPERIMENTAL]
bool fused_mmad; // whether to use fused mul+multi_add op [EXPERIMENTAL]
bool rope_cache; // whether to use RoPE cache [EXPERIMENTAL]
int min_experts;
float thresh_experts;
bool only_active_experts;