Files
ik_llama.cpp/github-data/pull_requests/264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md
2025-07-23 13:31:53 +02:00

1.1 KiB

🔀 #264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA

Author ikawrakow
State Closed
Created 2025-03-18
Updated 2025-03-18

Description

For DeepSeek-V3/R1 this reduces KV cache size by ~2 GiB for a context of 65k tokens.

Using

-amb 512 -mla 2 -fa -ctk q8_0

one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR #260 for meaning and effect of the -amb command line option.

There is still an issue with one or more of the GGML_OP_REPEAT, GGML_OP_CONCAT, GGML_OP_CPY operations on CUDA, which are required to implement the entire attention computation using quantized tensors, so this PR takes the pragmatic approach of computing the attention operations with fp16 on CUDA. The downside is that fp16 will be used also on the CPU if the code was built with CUDA enabled (and this is slower than using Q8_0 directly, wit the gap in performance increasing with context length).