RoPE cache (#887)

* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-11-03 18:42:20 +02:00
committed by GitHub
parent 846e736e85
commit fb0d5a995c
12 changed files with 1002 additions and 72 deletions

View File

@@ -639,6 +639,8 @@ extern "C" {
GGML_OP_SOFT_MAX_BACK,
GGML_OP_ROPE,
GGML_OP_ROPE_BACK,
GGML_OP_ROPE_CACHE,
GGML_OP_ROPE_FAST,
GGML_OP_CLAMP,
GGML_OP_CONV_TRANSPOSE_1D,
GGML_OP_IM2COL,
@@ -2020,6 +2022,26 @@ extern "C" {
float beta_fast,
float beta_slow);
GGML_API struct ggml_tensor * ggml_rope_cache(
struct ggml_context * ctx,
struct ggml_tensor * b,
struct ggml_tensor * c,
int ne0,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow);
GGML_API struct ggml_tensor * ggml_rope_fast(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b);
// clamp
// in-place, returns view(a)
GGML_API struct ggml_tensor * ggml_clamp(