Files
ik_llama.cpp/github-data/discussions/586 - Slow KV cache rm operation.md
2025-07-23 13:31:53 +02:00

4.4 KiB

🗣️ #586 - Slow KV cache rm operation

Author jneloexpirements
Created 2025-07-05
Updated 2025-07-05

Description

Is this related to #451 ? I am running DeepSeek-R1-V3-0324-IQ4_K_R4 (ubergarm's Q4) quant and while the token generation is decent (i have seen 12 tps at 0, around 66% when it goes to)

I use intel Xeon QYFS, 512GB DDR5 4800 RAM, and a RTX PRO 6000. I run the command below and also for real use case change it from sweep-bench to server with host/port

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
    --model /mnt/x/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 98304 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 8192 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    -ot "blk\.[3-9]\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 8192 -b 8192 \
    --parallel 1 \
    --threads 57

The above command puts VRAM usage to 90376 out of 97887 MiB.

....................................................................................................
llama_new_context_with_model: n_ctx      = 98304
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 8192
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  3499.90 MiB
llama_new_context_with_model: KV self size  = 3499.88 MiB, c^KV (q8_0): 3499.88 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument
llama_new_context_with_model:      CUDA0 compute buffer size = 20496.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  3296.09 MiB
llama_new_context_with_model: graph nodes  = 4219
llama_new_context_with_model: graph splits = 104

The raw PP seems to be proper and not irregularly slow from sweep-bench (in this example and also past ones)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 65.721 124.65 173.995 11.77
8192 2048 8192 69.385 118.07 190.416 10.76
8192 2048 16384 73.025 112.18 199.023 10.29
8192 2048 24576 76.688 106.82 204.607 10.01
8192 2048 32768 79.945 102.47 208.366 9.83

I can tolerate the TG but...

In real use cases however which are RAG heavy (feeding it long documents, then chatting for a while on it and websearch) and I like to flip flop between conversations, I have to wait for 2-5 minutes for KV cache removal.

INFO [            update_slots] kv cache rm [p0, end) | tid="125357154684928" timestamp=1751624758 id_slot=0 id_task=12104 p0=8410
INFO [           print_timings] prompt eval time     =  128443.90 ms / 10172 tokens (   12.63 ms per token,    79.19 tokens per second) |  timestamp=1751624830 id_slot=0 id_task=12104 t_prompt_processing=128443.905 n_prompt_tokens_processed=10172 t_token=12.627202615021627 n_tokens_second=79.19410422783393
INFO [           print_timings] generation eval time =   10688.65 ms /   122 runs   (   87.61 ms per token,    11.41 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_token_generation=10688.646 n_decoded=122 t_token=87.6118524590164 n_tokens_second=11.413980779230597

The time it took to for KV removal was around 3 minutes thats imo too slow. even if it is 8192 I tried with 4096 2048 or any number KV is just too slow.

  1. Does ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument have anything to do with that? How to fix this problem?
  2. Is 60-120 SPP for 4096/8192 batch expected for systems that offload Dense to GPU and experts to CPU?
  3. Is KV removal operation tied to PP or is it a separate thing?

Any help is appreciated so that I can mitigate before-generation slowdowns