1.9 KiB
🔀 #66 - CUDA non-contiguous RoPE
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-09-28 |
| Updated | 2024-09-28 |
Description
In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini (see #65 for details). This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080). There is also a 2-3% gain on Metal (M2-Max GPU).
Here is the combined effect of this PR and PR #65 on CUDA (RTX-4080) and Metal (M2-Max 30-core GPU) for Phi-3.5-mini:
| model | backend | ngl | threads | test | t/s (llama.cpp) | t/s (this PR) | Speedup |
|---|---|---|---|---|---|---|---|
| phi3 3B F16 | Metal | 100 | 4 | pp512 | 1003.22 ± 1.31 | 1063.84 ± 0.63 | 1.060 |
| phi3 3B F16 | Metal | 100 | 4 | tg128 | 39.32 ± 0.07 | 41.70 ± 0.06 | 1.061 |
| phi3 3B F16 | CUDA | 100 | 1 | pp512 | 11280.47 ± 26.75 | 13770.42 ± 84.46 | 1.221 |
| phi3 3B F16 | CUDA | 100 | 1 | tg128 | 79.84 ± 0.03 | 81.50 ± 0.02 | 1.021 |
💬 Conversation
👤 ikawrakow commented the 2024-09-28 at 12:42:05:
So, I see that there are a lot of models that can potentially benefit from this PR as the pattern
qkv = ggml_mul_mat(...);
Q = ggml_cont(..., qkv, ...);
K = ggml_cont(..., qkv, ...);
V = ggml_cont(..., qkv, ...);
is quite common in llama.cpp. But replacing the copies that make Q, K and V contiguous with appropriate views requires testing (it is easy to screw things up), and I don't feel like fetching N models and trying at this point. So, for now, just Phi-3(.5) benefits.