4.4 KiB
🗣️ #586 - Slow KV cache rm operation
| Author | jneloexpirements |
|---|---|
| Created | 2025-07-05 |
| Updated | 2025-07-05 |
Description
Is this related to #451 ? I am running DeepSeek-R1-V3-0324-IQ4_K_R4 (ubergarm's Q4) quant and while the token generation is decent (i have seen 12 tps at 0, around 66% when it goes to)
I use intel Xeon QYFS, 512GB DDR5 4800 RAM, and a RTX PRO 6000. I run the command below and also for real use case change it from sweep-bench to server with host/port
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model /mnt/x/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
--alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
--ctx-size 98304 \
-ctk q8_0 \
-mla 3 -fa \
-amb 8192 \
-fmoe \
--temp 0.3 \
--min-p 0.05 \
--n-gpu-layers 63 \
-ot "blk\.[3-9]\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 8192 -b 8192 \
--parallel 1 \
--threads 57
The above command puts VRAM usage to 90376 out of 97887 MiB.
....................................................................................................
llama_new_context_with_model: n_ctx = 98304
llama_new_context_with_model: n_batch = 8192
llama_new_context_with_model: n_ubatch = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 8192
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3499.90 MiB
llama_new_context_with_model: KV self size = 3499.88 MiB, c^KV (q8_0): 3499.88 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument
llama_new_context_with_model: CUDA0 compute buffer size = 20496.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 3296.09 MiB
llama_new_context_with_model: graph nodes = 4219
llama_new_context_with_model: graph splits = 104
The raw PP seems to be proper and not irregularly slow from sweep-bench (in this example and also past ones)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 65.721 | 124.65 | 173.995 | 11.77 |
| 8192 | 2048 | 8192 | 69.385 | 118.07 | 190.416 | 10.76 |
| 8192 | 2048 | 16384 | 73.025 | 112.18 | 199.023 | 10.29 |
| 8192 | 2048 | 24576 | 76.688 | 106.82 | 204.607 | 10.01 |
| 8192 | 2048 | 32768 | 79.945 | 102.47 | 208.366 | 9.83 |
I can tolerate the TG but...
In real use cases however which are RAG heavy (feeding it long documents, then chatting for a while on it and websearch) and I like to flip flop between conversations, I have to wait for 2-5 minutes for KV cache removal.
INFO [ update_slots] kv cache rm [p0, end) | tid="125357154684928" timestamp=1751624758 id_slot=0 id_task=12104 p0=8410
INFO [ print_timings] prompt eval time = 128443.90 ms / 10172 tokens ( 12.63 ms per token, 79.19 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_prompt_processing=128443.905 n_prompt_tokens_processed=10172 t_token=12.627202615021627 n_tokens_second=79.19410422783393
INFO [ print_timings] generation eval time = 10688.65 ms / 122 runs ( 87.61 ms per token, 11.41 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_token_generation=10688.646 n_decoded=122 t_token=87.6118524590164 n_tokens_second=11.413980779230597
The time it took to for KV removal was around 3 minutes thats imo too slow. even if it is 8192 I tried with 4096 2048 or any number KV is just too slow.
- Does
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argumenthave anything to do with that? How to fix this problem? - Is 60-120 SPP for 4096/8192 batch expected for systems that offload Dense to GPU and experts to CPU?
- Is KV removal operation tied to PP or is it a separate thing?
Any help is appreciated so that I can mitigate before-generation slowdowns