### 🐛 [#575](https://github.com/ikawrakow/ik_llama.cpp/issues/575) - Bug: llama-server crash with sampling order | **Author** | `mcm007` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-07-03 | | **Updated** | 2025-07-06 | --- #### Description ### What happened? The OpenAi endpoint crashes when samplers order is specified with `--samplers "min_p;temperature"` or `--sampling-seq "mt"` after [Commit 3f111ad](https://github.com/ikawrakow/ik_llama.cpp/commit/3f111ad7bbb2d4f721332f9b2b344e48b3bbf9aa) ([add dry sampler #513 ](https://github.com/ikawrakow/ik_llama.cpp/pull/513)). Behavior observed with [aider](https://aider.chat/) but can be reproduced with curl: ``` curl -k ik_llamacpp:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{ "model": "Qwen_Qwen3-0.6B-Q6_K.gguf", "messages": [ { "role": "user", "content": "Hello!" } ] }' ``` Webui works correctly. The same result with other models, fa, mla, moe. ### Name and Version ``` version: 3760 (3f111ad7) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu ``` ### What operating system are you seeing the problem on? Linux ### Relevant log output ```shell # Build cmake -B build -DGGML_NATIVE=ON cmake --build build --config Release -j$(nproc) # Run llama-server --host 0.0.0.0 --port 8080 --ctx-size 4096 --verbose --model /models1/Qwen_Qwen3-0.6B-Q6_K.gguf # Log llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 448.00 MiB llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB llama_new_context_with_model: CPU output buffer size = 1.16 MiB llama_new_context_with_model: CPU compute buffer size = 300.75 MiB llama_new_context_with_model: graph nodes = 873 llama_new_context_with_model: graph splits = 1 INFO [ init] initializing slots | tid="139998054885568" timestamp=1751531864 n_slots=1 INFO [ init] new slot | tid="139998054885568" timestamp=1751531864 id_slot=0 n_ctx_slot=4096 INFO [ main] model loaded | tid="139998054885568" timestamp=1751531864 INFO [ main] chat template | tid="139998054885568" timestamp=1751531864 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true INFO [ main] HTTP server listening | tid="139998054885568" timestamp=1751531864 n_threads_http="3" port="8080" hostname="0.0.0.0" VERB [ start_loop] new task may arrive | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] update_multitasks | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531864 INFO [ update_slots] all slots are idle | tid="139998054885568" timestamp=1751531864 VERB [ kv_cache_clear] clearing KV cache | tid="139998054885568" timestamp=1751531864 VERB [ get_new_id] new task id | tid="139996550641216" timestamp=1751531864 new_id=0 VERB [ add_waiting_task_id] waiting for task id | tid="139996550641216" timestamp=1751531864 id_task=0 VERB [ start_loop] wait for new task | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] new task may arrive | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] callback_new_task | tid="139998054885568" timestamp=1751531864 id_task=0 INFO [ process_single_task] slot data | tid="139998054885568" timestamp=1751531864 id_task=0 n_idle_slots=1 n_processing_slots=0 VERB [ process_single_task] slot data | tid="139998054885568" timestamp=1751531864 id_task=0 n_idle_slots=1 n_processing_slots=0 slots=[{"n_ctx":4096,"n_predict":-1,"model":"/models1/Qwen_Qwen3-0.6B-Q6_K.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["min_p","temperature"],"id":0,"id_task":-1,"state":0,"prompt":null,"next_token":{"has_next_token":true,"n_remain":-1,"n_decoded":0,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}}] VERB [ send] send new result | tid="139998054885568" timestamp=1751531864 id_task=0 VERB [ send] queue_results.push_back | tid="139998054885568" timestamp=1751531864 id_task=0 VERB [ start_loop] update_multitasks | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531864 INFO [ update_slots] all slots are idle | tid="139998054885568" timestamp=1751531864 VERB [ start_loop] wait for new task | tid="139998054885568" timestamp=1751531864 VERB [ remove_waiting_task_id] remove waiting for task id | tid="139996550641216" timestamp=1751531864 id_task=0 INFO [ log_server_request] request | tid="139996550641216" timestamp=1751531864 remote_addr="127.0.0.1" remote_port=40444 status=200 method="GET" path="/health" params={} VERB [ log_server_request] request | tid="139996550641216" timestamp=1751531864 request="" response="{\"status\":\"ok\",\"slots_idle\":1,\"slots_processing\":0}" VERB [ format_chat] formatted_chat | tid="139996472276544" timestamp=1751531867 text="<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n" VERB [ get_new_id] new task id | tid="139996472276544" timestamp=1751531867 new_id=1 VERB [ add_waiting_task_id] waiting for task id | tid="139996472276544" timestamp=1751531867 id_task=1 VERB [ start_loop] new task may arrive | tid="139998054885568" timestamp=1751531867 VERB [ start_loop] callback_new_task | tid="139998054885568" timestamp=1751531867 id_task=1 VERB [ get_available_slot] selected slot by lru | tid="139998054885568" timestamp=1751531867 id_slot=0 t_last=-1 INFO [ launch_slot_with_task] slot is processing task | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1 VERB [ start_loop] update_multitasks | tid="139998054885568" timestamp=1751531867 VERB [ start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531867 VERB [ update_slots] posting NEXT_RESPONSE | tid="139998054885568" timestamp=1751531867 VERB [ post] new task id | tid="139998054885568" timestamp=1751531867 new_id=2 VERB [ update_slots] tokenizing prompt | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1 VERB [ update_slots] prompt tokenized | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1 n_ctx=4096 n_keep=0 n_prompt_tokens=10 prompt_tokens="<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n" ``` --- #### 💬 Conversation 👤 **ikawrakow** commented the **2025-07-03** at **09:24:06**:
Is this one example of many where it crashes, or is this the only sampler combination for which it crashes? --- 👤 **mcm007** commented the **2025-07-03** at **09:59:07**:
After some tests, it seems that crashes when `dry` is not specified: Failing: --samplers "top_k" --samplers "top_k;tfs_z" --samplers "top_k;tfs_z;typical_p;top_p;min_p;temperature --samplers "top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --sampling-seq "mt" Working: --samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --samplers "dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --samplers "dry" --samplers "dry;min_p;temperature" --samplers "min_p;temperature;dry" --sampling-seq "mtd" --sampling-seq "dt" --- 👤 **ikawrakow** commented the **2025-07-03** at **12:45:33**:
Thanks for the bug report. #578 should fix it. --- 👤 **mcm007** commented the **2025-07-03** at **20:17:21**:
Sorry, it has the same behavior/crash 🙄 ``` version: 3785 (3e024de1) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu ``` Please consider this a low priority, can even be ignored. Vulkan and all the other improvements are really appreciated. --- 👤 **ikawrakow** commented the **2025-07-05** at **13:12:19**:
This is strange. I tested `llama-cli` with `--sampling-seq mt`, and it works fine after this PR. --- 👤 **mcm007** commented the **2025-07-05** at **18:17:15**:
Indeed, just tested, `llama-cli` is working after this PR. From what I see, `llama-server` is still crashing for both API endpoints `/completion` and `/v1/chat/completions` ``` curl -k ik_llamacpp:8080/completion -H "Content-Type: application/json" -d '{ "prompt": "Once upon a time", "n_predict": 50 }' ``` ``` curl -k ik_llamacpp:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen_Qwen3-0.6B-Q6_K.gguf", "messages": [ { "role": "user", "content": "Hello!" } ] }' ``` --- 👤 **firecoperana** commented the **2025-07-06** at **00:54:04**:
https://github.com/ikawrakow/ik_llama.cpp/pull/588 should fix the server crash --- 👤 **mcm007** commented the **2025-07-06** at **06:30:29**:
It works OK, thank you both!