ik_llama.cpp/586 - Slow KV cache rm operation.md at eaa2510a28b60d43c2210c69cefdf750d5cc119f - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

4.4 KiB

Raw Blame History

🗣️ #586 - Slow KV cache rm operation

Author	`jneloexpirements`
Created	2025-07-05
Updated	2025-07-05

Description

Is this related to #451 ? I am running DeepSeek-R1-V3-0324-IQ4_K_R4 (ubergarm's Q4) quant and while the token generation is decent (i have seen 12 tps at 0, around 66% when it goes to)

I use intel Xeon QYFS, 512GB DDR5 4800 RAM, and a RTX PRO 6000. I run the command below and also for real use case change it from sweep-bench to server with host/port

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
    --model /mnt/x/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 98304 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 8192 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    -ot "blk\.[3-9]\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 8192 -b 8192 \
    --parallel 1 \
    --threads 57

The above command puts VRAM usage to 90376 out of 97887 MiB.

....................................................................................................
llama_new_context_with_model: n_ctx      = 98304
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 8192
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  3499.90 MiB
llama_new_context_with_model: KV self size  = 3499.88 MiB, c^KV (q8_0): 3499.88 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument
llama_new_context_with_model:      CUDA0 compute buffer size = 20496.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  3296.09 MiB
llama_new_context_with_model: graph nodes  = 4219
llama_new_context_with_model: graph splits = 104

The raw PP seems to be proper and not irregularly slow from sweep-bench (in this example and also past ones)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	65.721	124.65	173.995	11.77
8192	2048	8192	69.385	118.07	190.416	10.76
8192	2048	16384	73.025	112.18	199.023	10.29
8192	2048	24576	76.688	106.82	204.607	10.01
8192	2048	32768	79.945	102.47	208.366	9.83

I can tolerate the TG but...

In real use cases however which are RAG heavy (feeding it long documents, then chatting for a while on it and websearch) and I like to flip flop between conversations, I have to wait for 2-5 minutes for KV cache removal.

INFO [            update_slots] kv cache rm [p0, end) | tid="125357154684928" timestamp=1751624758 id_slot=0 id_task=12104 p0=8410
INFO [           print_timings] prompt eval time     =  128443.90 ms / 10172 tokens (   12.63 ms per token,    79.19 tokens per second) |  timestamp=1751624830 id_slot=0 id_task=12104 t_prompt_processing=128443.905 n_prompt_tokens_processed=10172 t_token=12.627202615021627 n_tokens_second=79.19410422783393
INFO [           print_timings] generation eval time =   10688.65 ms /   122 runs   (   87.61 ms per token,    11.41 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_token_generation=10688.646 n_decoded=122 t_token=87.6118524590164 n_tokens_second=11.413980779230597

The time it took to for KV removal was around 3 minutes thats imo too slow. even if it is 8192 I tried with 4096 2048 or any number KV is just too slow.

Does ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument have anything to do with that? How to fix this problem?
Is 60-120 SPP for 4096/8192 batch expected for systems that offload Dense to GPU and experts to CPU?
Is KV removal operation tied to PP or is it a separate thing?

Any help is appreciated so that I can mitigate before-generation slowdowns

4.4 KiB Raw Blame History

🗣️ #586 - Slow KV cache rm operation

Description

4.4 KiB

Raw Blame History