### 🐛 [#455](https://github.com/ikawrakow/ik_llama.cpp/issues/455) - Bug: KV cache is never reused in OpenAI compatible Chat Completion api | **Author** | `luzamm` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-05-24 | | **Updated** | 2025-05-28 | --- #### Description ### What happened? I use OpenAI compatible API Chat Completion on both Open WebUI and SillyTavern, the whole prompt will **always** re-evaluate from position p0 when I just regenerate the last message. The log shows I generated 1 time and retried 2 times, totally 3 time to generate the answer. Ideally, it should use kv cache on last 2 retries because nothing changed but it didn't use the cache. model: unsloth/DeepSeek-V3-0324-GGUF-UD system prompt: You are a helpful assistant. message1: Introduce AMD. message2: Just tell me who is the CEO? I regenerated message2's reply Text Completion API and llama-server's built-in web server seems work well, cache was used. I tried llama.cpp and it work well both in Chat Completion API and Text Completion API. llama.cpp info (**not** ik_llama.cpp) cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_CUDA_FA_ALL_QUANTS=ON root@pve:~/llm/llama.cpp# ./build/bin/llama-server --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes version: 5474 (259469c4) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu ### Name and Version ik_llama.cpp build command: cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF version: 3712 (c7ecd4e2) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu ### What operating system are you seeing the problem on? Linux ### Relevant log output ```shell root@pve:~/llm/ik_llama.cpp# ./build/bin/llama-server --alias unsloth/DeepSeek-R1-Q4_K_XL --model /mnt/pve/PE8110/llm/models/DeepSeek-V3-0324-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf -rtr --ctx-size 32768 -ctk q8_0 -mla 3 -fa -amb 512 -fmoe --n-gpu-layers 999 --override-tensor exps=CPU --parallel 1 --threads 60 --host 0.0.0.0 --port 5001 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes INFO [ main] build info | tid="137281198051328" timestamp=1748126804 build=3712 commit="c7ecd4e2" INFO [ main] system info | tid="137281198051328" timestamp=1748126804 n_threads=60 n_threads_batch=-1 total_threads=128 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: additional 7 GGUFs metadata loaded. llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /mnt/pve/PE8110/llm/models/DeepSeek-V3-0324-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Deepseek-V3-0324 llama_model_loader: - kv 3: general.version str = V3-0324 llama_model_loader: - kv 4: general.basename str = Deepseek-V3-0324 llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 256x20B llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 9: general.base_model.count u32 = 1 llama_model_loader: - kv 10: general.base_model.0.name str = DeepSeek V3 0324 llama_model_loader: - kv 11: general.base_model.0.version str = V3-0324 llama_model_loader: - kv 12: general.base_model.0.organization str = Deepseek Ai llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De... llama_model_loader: - kv 14: general.tags arr[str,4] = ["deepseek_v3", "deepseek", "unsloth"... llama_model_loader: - kv 15: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 16: deepseek2.block_count u32 = 61 llama_model_loader: - kv 17: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 18: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 19: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 20: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 21: deepseek2.attention.head_count_kv u32 = 1 llama_model_loader: - kv 22: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 23: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 24: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 25: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 26: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 27: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 28: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 29: deepseek2.attention.key_length u32 = 576 llama_model_loader: - kv 30: deepseek2.attention.value_length u32 = 512 llama_model_loader: - kv 31: deepseek2.attention.key_length_mla u32 = 192 llama_model_loader: - kv 32: deepseek2.attention.value_length_mla u32 = 128 llama_model_loader: - kv 33: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 34: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 35: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 36: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 37: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 38: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 39: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 40: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 41: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 42: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 43: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 44: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 45: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 46: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... llama_model_loader: - kv 47: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 48: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 49: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 50: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 51: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 52: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 53: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 54: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 55: general.quantization_version u32 = 2 llama_model_loader: - kv 56: general.file_type u32 = 15 llama_model_loader: - kv 57: quantize.imatrix.file str = DeepSeek-V3-0324-GGUF/imatrix_unsloth... llama_model_loader: - kv 58: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-V3-0324.txt llama_model_loader: - kv 59: quantize.imatrix.entries_count i32 = 720 llama_model_loader: - kv 60: quantize.imatrix.chunks_count i32 = 60 llama_model_loader: - kv 61: split.no u16 = 0 llama_model_loader: - kv 62: split.tensors.count i32 = 1086 llama_model_loader: - kv 63: split.count u16 = 8 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 122 tensors llama_model_loader: - type q4_K: 485 tensors llama_model_loader: - type q5_K: 95 tensors llama_model_loader: - type q6_K: 23 tensors ========================================================================== Detected incompatible DeepSeek model. Will try to fix, but there are no guarantees *** Your prompt processing speed will be crippled *** Consider making your own ik_llama.cpp compatible model or ask the model provider to make one for you, ========================================================================== llm_load_vocab: special tokens cache size = 818 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 671.026 B llm_load_print_meta: model size = 357.623 GiB (4.578 BPW) llm_load_print_meta: repeating layers = 356.429 GiB (4.575 BPW, 669.173 B parameters) llm_load_print_meta: general.name = Deepseek-V3-0324 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 2 '<|▁pad▁|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 llm_load_tensors: ggml ctx size = 0.89 MiB Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU llm_load_tensors: offloading 61 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 62/62 layers to GPU llm_load_tensors: CPU buffer size = 355712.00 MiB llm_load_tensors: CUDA_Host buffer size = 497.11 MiB llm_load_tensors: CUDA0 buffer size = 9996.68 MiB .................................................................................................... ============ llm_prepare_mla: need to compute 61 wkv_b tensors Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0 ============ Repacked 174 tensors llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 3 llama_new_context_with_model: attn_max_b = 512 llama_new_context_with_model: fused_moe = 1 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 3425.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 176.01 MiB llama_new_context_with_model: graph nodes = 8245 llama_new_context_with_model: graph splits = 118 INFO [ init] initializing slots | tid="137281198051328" timestamp=1748127054 n_slots=1 INFO [ init] new slot | tid="137281198051328" timestamp=1748127054 id_slot=0 n_ctx_slot=32768 INFO [ main] model loaded | tid="137281198051328" timestamp=1748127054 INFO [ main] chat template | tid="137281198051328" timestamp=1748127054 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true INFO [ main] HTTP server listening | tid="137281198051328" timestamp=1748127054 n_threads_http="127" port="5001" hostname="0.0.0.0" INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127054 INFO [ log_server_request] request | tid="136894792617984" timestamp=1748127109 remote_addr="192.168.123.99" remote_port=39142 status=200 method="GET" path="/v1/models" params={} INFO [ log_server_request] request | tid="136894775832576" timestamp=1748127145 remote_addr="192.168.123.99" remote_port=33258 status=200 method="GET" path="/v1/models" params={} INFO [ log_server_request] request | tid="136894801010688" timestamp=1748127169 remote_addr="192.168.123.99" remote_port=57604 status=200 method="GET" path="/v1/models" params={} INFO [ log_server_request] request | tid="137279920132096" timestamp=1748127207 remote_addr="192.168.123.99" remote_port=39902 status=200 method="GET" path="/v1/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="137281198051328" timestamp=1748127207 id_slot=0 id_task=0 INFO [ update_slots] kv cache rm [p0, end) | tid="137281198051328" timestamp=1748127207 id_slot=0 id_task=0 p0=0 INFO [ print_timings] prompt eval time = 1170.90 ms / 13 tokens ( 90.07 ms per token, 11.10 tokens per second) | tid="137281198051328" timestamp=1748127268 id_slot=0 id_task=0 t_prompt_processing=1170.897 n_prompt_tokens_processed=13 t_token=90.06899999999999 n_tokens_second=11.10259911845363 INFO [ print_timings] generation eval time = 59250.24 ms / 514 runs ( 115.27 ms per token, 8.68 tokens per second) | tid="137281198051328" timestamp=1748127268 id_slot=0 id_task=0 t_token_generation=59250.237 n_decoded=514 t_token=115.27283463035019 n_tokens_second=8.675070784948927 INFO [ print_timings] total time = 60421.13 ms | tid="137281198051328" timestamp=1748127268 id_slot=0 id_task=0 t_prompt_processing=1170.897 t_token_generation=59250.237 t_total=60421.134 INFO [ update_slots] slot released | tid="137281198051328" timestamp=1748127268 id_slot=0 id_task=0 n_ctx=32768 n_past=526 n_system_tokens=0 n_cache_tokens=0 truncated=false INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127268 INFO [ log_server_request] request | tid="137279819341824" timestamp=1748127268 remote_addr="192.168.123.99" remote_port=39910 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127268 INFO [ log_server_request] request | tid="137279737688064" timestamp=1748127286 remote_addr="192.168.123.99" remote_port=43354 status=200 method="GET" path="/v1/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="137281198051328" timestamp=1748127286 id_slot=0 id_task=516 INFO [ update_slots] kv cache rm [p0, end) | tid="137281198051328" timestamp=1748127286 id_slot=0 id_task=516 p0=0 INFO [ print_timings] prompt eval time = 6383.32 ms / 536 tokens ( 11.91 ms per token, 83.97 tokens per second) | tid="137281198051328" timestamp=1748127305 id_slot=0 id_task=516 t_prompt_processing=6383.325 n_prompt_tokens_processed=536 t_token=11.90918843283582 n_tokens_second=83.96877802712537 INFO [ print_timings] generation eval time = 12977.77 ms / 113 runs ( 114.85 ms per token, 8.71 tokens per second) | tid="137281198051328" timestamp=1748127305 id_slot=0 id_task=516 t_token_generation=12977.773 n_decoded=113 t_token=114.84754867256636 n_tokens_second=8.707194986381717 INFO [ print_timings] total time = 19361.10 ms | tid="137281198051328" timestamp=1748127305 id_slot=0 id_task=516 t_prompt_processing=6383.325 t_token_generation=12977.773 t_total=19361.097999999998 INFO [ update_slots] slot released | tid="137281198051328" timestamp=1748127305 id_slot=0 id_task=516 n_ctx=32768 n_past=648 n_system_tokens=0 n_cache_tokens=0 truncated=false INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127305 INFO [ log_server_request] request | tid="137279729295360" timestamp=1748127305 remote_addr="192.168.123.99" remote_port=43366 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127305 INFO [ log_server_request] request | tid="137279720902656" timestamp=1748127309 remote_addr="192.168.123.99" remote_port=51502 status=200 method="GET" path="/v1/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="137281198051328" timestamp=1748127309 id_slot=0 id_task=631 INFO [ update_slots] kv cache rm [p0, end) | tid="137281198051328" timestamp=1748127309 id_slot=0 id_task=631 p0=0 INFO [ print_timings] prompt eval time = 6326.97 ms / 536 tokens ( 11.80 ms per token, 84.72 tokens per second) | tid="137281198051328" timestamp=1748127329 id_slot=0 id_task=631 t_prompt_processing=6326.966 n_prompt_tokens_processed=536 t_token=11.80404104477612 n_tokens_second=84.71675049304832 INFO [ print_timings] generation eval time = 12948.27 ms / 113 runs ( 114.59 ms per token, 8.73 tokens per second) | tid="137281198051328" timestamp=1748127329 id_slot=0 id_task=631 t_token_generation=12948.269 n_decoded=113 t_token=114.58645132743364 n_tokens_second=8.727035250812289 INFO [ print_timings] total time = 19275.24 ms | tid="137281198051328" timestamp=1748127329 id_slot=0 id_task=631 t_prompt_processing=6326.966 t_token_generation=12948.269 t_total=19275.235 INFO [ update_slots] slot released | tid="137281198051328" timestamp=1748127329 id_slot=0 id_task=631 n_ctx=32768 n_past=648 n_system_tokens=0 n_cache_tokens=0 truncated=false INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127329 INFO [ log_server_request] request | tid="137279712509952" timestamp=1748127329 remote_addr="192.168.123.99" remote_port=51508 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127329 INFO [ log_server_request] request | tid="137279704117248" timestamp=1748127337 remote_addr="192.168.123.99" remote_port=55810 status=200 method="GET" path="/v1/models" params={} INFO [ launch_slot_with_task] slot is processing task | tid="137281198051328" timestamp=1748127337 id_slot=0 id_task=746 INFO [ update_slots] kv cache rm [p0, end) | tid="137281198051328" timestamp=1748127337 id_slot=0 id_task=746 p0=0 INFO [ print_timings] prompt eval time = 6375.81 ms / 536 tokens ( 11.90 ms per token, 84.07 tokens per second) | tid="137281198051328" timestamp=1748127356 id_slot=0 id_task=746 t_prompt_processing=6375.806 n_prompt_tokens_processed=536 t_token=11.895160447761194 n_tokens_second=84.06780256488356 INFO [ print_timings] generation eval time = 12939.86 ms / 113 runs ( 114.51 ms per token, 8.73 tokens per second) | tid="137281198051328" timestamp=1748127356 id_slot=0 id_task=746 t_token_generation=12939.857 n_decoded=113 t_token=114.51200884955752 n_tokens_second=8.73270856084422 INFO [ print_timings] total time = 19315.66 ms | tid="137281198051328" timestamp=1748127356 id_slot=0 id_task=746 t_prompt_processing=6375.806 t_token_generation=12939.857 t_total=19315.663 INFO [ update_slots] slot released | tid="137281198051328" timestamp=1748127356 id_slot=0 id_task=746 n_ctx=32768 n_past=648 n_system_tokens=0 n_cache_tokens=0 truncated=false INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127356 INFO [ log_server_request] request | tid="137279695724544" timestamp=1748127356 remote_addr="192.168.123.99" remote_port=55822 status=200 method="POST" path="/v1/chat/completions" params={} INFO [ update_slots] all slots are idle | tid="137281198051328" timestamp=1748127356 ``` --- #### 💬 Conversation 👤 **saood06** commented the **2025-05-24** at **23:39:01**:
Are you passing in `cache_prompt: true` in your request? I know llama.cpp now defaults to it being on, but we do not do that here (would be trivial to change), so as it stands it will not reuse the cache unless you pass that. Edit: Just want to add I use the server and I can get KV cache to be reused between prompts where the prefix is shared, so it does work for me with that passed in my requests. --- 👤 **saood06** commented the **2025-05-24** at **23:39:01**:
Are you passing in `cache_prompt: true` in your request? I know llama.cpp now defaults to it being on, but we do not do that here (would be trivial to change), so as it stands it will not reuse the cache unless you pass that. --- 👤 **ikawrakow** commented the **2025-05-25** at **04:32:30**:
@saood06 Maybe we should change the default? --- 👤 **saood06** commented the **2025-05-25** at **04:49:04**:
> [@saood06](https://github.com/saood06) Maybe we should change the default? I agree, it's a trivial change and with the implementation of caching that we have here there is almost no reason to turn it off. I've been tinkering with an alternative caching mechanism as I don't fully like the new one mainline has with chunking since I'm fairly certain there are quality losses with especially if done excessively with small chunks. My alternative is more involved and has other benefits but it's still nowhere close to being done or even draft PR ready. --- 👤 **luzamm** commented the **2025-05-25** at **08:55:07**:
After passing cache_prompt:true , it worked well. But there are many webui do not pass this field and nowhere to add easily. Is it better to turn it on by default? --- 👤 **saood06** commented the **2025-05-25** at **09:17:43**:
> After passing cache_prompt:true , it worked well. I am glad to hear that. >But there are many webui do not pass this field and nowhere to add easily. Is it better to turn it on by default? Yes, I will do that. I looked into it enough to deem it trivial, just haven't gotten around to it yet, but I will get to it. I'll mark this closed once the default is set. --- 👤 **Ph0rk0z** commented the **2025-05-25** at **16:28:04**:
It never reprocess my cache because I used text completion with sillytavern. What happens when you reach the context limit? I know that mainline has some mechanism for that. Does it just reprocess context with every message post limit? --- 👤 **saood06** commented the **2025-05-28** at **01:00:43**:
@luzamm Sorry for the delay, but the PR has been made that changes the default, and I have linked it to this issue to automatically close once it gets merged in. @Ph0rk0z >It never reprocess my cache because I used text completion with sillytavern. What happens when you reach the context limit? I know that mainline has some mechanism for that. Does it just reprocess context with every message post limit? There is a feature called context shifting that will shift the entire context window (by I think half?) while keeping the system_prompt (if used). This feature does not work for all models and in my own personal experience leads to a noticeable and often severe degradation in output quality, but for some of my use-cases it was fine. I have not used context shifting in a long time but as far as I can tell the implementation here is the same as the one I have experienced. --- 👤 **Ph0rk0z** commented the **2025-05-28** at **15:12:09**:
>I have not used context shifting in a long time but as far as I can tell the implementation here is the same as the one I have experienced. I thought it doesn't work here because it was forked before the implementation in main. There is no --cache-reuse flag and I see nothing about context shift. Only ever tried the implementation in ooba. --- 👤 **saood06** commented the **2025-05-28** at **22:04:21**:
> I thought it doesn't work here because it was forked before the implementation in main. There is no --cache-reuse flag and I see nothing about context shift. Only ever tried the implementation in ooba. You are talking about two different things. Context shifting (which allows for an "infinite" amount of chatting) is supported see the code [here](https://github.com/ikawrakow/ik_llama.cpp/blob/ccd6d9cdf6851f7042c48d682daf47bc0e2eca27/examples/server/server.cpp#L1946) but there is no documentation for it. I do not plan to port over the `--cache-reuse` flag from mainline which allows for you to reuse chunks of the prompt since it results in quality losses (although when used reasonably those quality losses may be acceptable or even imperceptible). I am working on an alternative that will have different tradeoffs (it will actually be better for some situations, but worse in others since it won't chunk the cache).