Files
ik_llama.cpp/github-data/pull_requests/66 - CUDA non-contiguous RoPE.md
2025-07-23 13:31:53 +02:00

1.9 KiB

🔀 #66 - CUDA non-contiguous RoPE

Author ikawrakow
State Closed
Created 2024-09-28
Updated 2024-09-28

Description

In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini (see #65 for details). This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080). There is also a 2-3% gain on Metal (M2-Max GPU).

Here is the combined effect of this PR and PR #65 on CUDA (RTX-4080) and Metal (M2-Max 30-core GPU) for Phi-3.5-mini:

model backend ngl threads test t/s (llama.cpp) t/s (this PR) Speedup
phi3 3B F16 Metal 100 4 pp512 1003.22 ± 1.31 1063.84 ± 0.63 1.060
phi3 3B F16 Metal 100 4 tg128 39.32 ± 0.07 41.70 ± 0.06 1.061
phi3 3B F16 CUDA 100 1 pp512 11280.47 ± 26.75 13770.42 ± 84.46 1.221
phi3 3B F16 CUDA 100 1 tg128 79.84 ± 0.03 81.50 ± 0.02 1.021

💬 Conversation

👤 ikawrakow commented the 2024-09-28 at 12:42:05:

So, I see that there are a lot of models that can potentially benefit from this PR as the pattern

qkv = ggml_mul_mat(...);
Q = ggml_cont(..., qkv, ...);
K = ggml_cont(..., qkv, ...);
V = ggml_cont(..., qkv, ...);

is quite common in llama.cpp. But replacing the copies that make Q, K and V contiguous with appropriate views requires testing (it is easy to screw things up), and I don't feel like fetching N models and trying at this point. So, for now, just Phi-3(.5) benefits.