ik_llama.cpp/575 - Bug_ llama-server crash with sampling order.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

10 KiB

Raw Permalink Blame History

🐛 #575 - Bug: llama-server crash with sampling order

Author	`mcm007`
State	❌ Closed
Created	2025-07-03
Updated	2025-07-06

Description

What happened?

The OpenAi endpoint crashes when samplers order is specified with --samplers "min_p;temperature" or --sampling-seq "mt" after Commit 3f111ad (add dry sampler #513 ).

Behavior observed with aider but can be reproduced with curl:

curl -k  ik_llamacpp:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
    "model": "Qwen_Qwen3-0.6B-Q6_K.gguf",
    "messages": [
    {
        "role": "user",
        "content": "Hello!"
    }
    ]
    }'

Webui works correctly.

The same result with other models, fa, mla, moe.

Name and Version

version: 3760 (3f111ad7)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

# Build
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

# Run
llama-server --host 0.0.0.0 --port 8080 --ctx-size 4096 --verbose --model /models1/Qwen_Qwen3-0.6B-Q6_K.gguf

# Log
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   448.00 MiB
llama_new_context_with_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.16 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 873
llama_new_context_with_model: graph splits = 1
INFO [                    init] initializing slots | tid="139998054885568" timestamp=1751531864 n_slots=1
INFO [                    init] new slot | tid="139998054885568" timestamp=1751531864 id_slot=0 n_ctx_slot=4096
INFO [                    main] model loaded | tid="139998054885568" timestamp=1751531864
INFO [                    main] chat template | tid="139998054885568" timestamp=1751531864 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="139998054885568" timestamp=1751531864 n_threads_http="3" port="8080" hostname="0.0.0.0"
VERB [              start_loop] new task may arrive | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] update_multitasks | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531864
INFO [            update_slots] all slots are idle | tid="139998054885568" timestamp=1751531864
VERB [          kv_cache_clear] clearing KV cache | tid="139998054885568" timestamp=1751531864
VERB [              get_new_id] new task id | tid="139996550641216" timestamp=1751531864 new_id=0
VERB [     add_waiting_task_id] waiting for task id | tid="139996550641216" timestamp=1751531864 id_task=0
VERB [              start_loop] wait for new task | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] new task may arrive | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] callback_new_task | tid="139998054885568" timestamp=1751531864 id_task=0
INFO [     process_single_task] slot data | tid="139998054885568" timestamp=1751531864 id_task=0 n_idle_slots=1 n_processing_slots=0
VERB [     process_single_task] slot data | tid="139998054885568" timestamp=1751531864 id_task=0 n_idle_slots=1 n_processing_slots=0 slots=[{"n_ctx":4096,"n_predict":-1,"model":"/models1/Qwen_Qwen3-0.6B-Q6_K.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["min_p","temperature"],"id":0,"id_task":-1,"state":0,"prompt":null,"next_token":{"has_next_token":true,"n_remain":-1,"n_decoded":0,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}}]
VERB [                    send] send new result | tid="139998054885568" timestamp=1751531864 id_task=0
VERB [                    send] queue_results.push_back | tid="139998054885568" timestamp=1751531864 id_task=0
VERB [              start_loop] update_multitasks | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531864
INFO [            update_slots] all slots are idle | tid="139998054885568" timestamp=1751531864
VERB [              start_loop] wait for new task | tid="139998054885568" timestamp=1751531864
VERB [  remove_waiting_task_id] remove waiting for task id | tid="139996550641216" timestamp=1751531864 id_task=0
INFO [      log_server_request] request | tid="139996550641216" timestamp=1751531864 remote_addr="127.0.0.1" remote_port=40444 status=200 method="GET" path="/health" params={}
VERB [      log_server_request] request | tid="139996550641216" timestamp=1751531864 request="" response="{\"status\":\"ok\",\"slots_idle\":1,\"slots_processing\":0}"
VERB [             format_chat] formatted_chat | tid="139996472276544" timestamp=1751531867 text="<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"
VERB [              get_new_id] new task id | tid="139996472276544" timestamp=1751531867 new_id=1
VERB [     add_waiting_task_id] waiting for task id | tid="139996472276544" timestamp=1751531867 id_task=1
VERB [              start_loop] new task may arrive | tid="139998054885568" timestamp=1751531867
VERB [              start_loop] callback_new_task | tid="139998054885568" timestamp=1751531867 id_task=1
VERB [      get_available_slot] selected slot by lru | tid="139998054885568" timestamp=1751531867 id_slot=0 t_last=-1
INFO [   launch_slot_with_task] slot is processing task | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1
VERB [              start_loop] update_multitasks | tid="139998054885568" timestamp=1751531867
VERB [              start_loop] callback_update_slots | tid="139998054885568" timestamp=1751531867
VERB [            update_slots] posting NEXT_RESPONSE | tid="139998054885568" timestamp=1751531867
VERB [                    post] new task id | tid="139998054885568" timestamp=1751531867 new_id=2
VERB [            update_slots] tokenizing prompt | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1
VERB [            update_slots] prompt tokenized | tid="139998054885568" timestamp=1751531867 id_slot=0 id_task=1 n_ctx=4096 n_keep=0 n_prompt_tokens=10 prompt_tokens="<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"

💬 Conversation

👤 ikawrakow commented the 2025-07-03 at 09:24:06:

Is this one example of many where it crashes, or is this the only sampler combination for which it crashes?

👤 mcm007 commented the 2025-07-03 at 09:59:07:

After some tests, it seems that crashes when dry is not specified:

Failing: --samplers "top_k" --samplers "top_k;tfs_z" --samplers "top_k;tfs_z;typical_p;top_p;min_p;temperature --samplers "top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --sampling-seq "mt"

Working: --samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --samplers "dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" --samplers "dry" --samplers "dry;min_p;temperature" --samplers "min_p;temperature;dry" --sampling-seq "mtd" --sampling-seq "dt"

👤 ikawrakow commented the 2025-07-03 at 12:45:33:

Thanks for the bug report. #578 should fix it.

👤 mcm007 commented the 2025-07-03 at 20:17:21:

Sorry, it has the same behavior/crash 🙄

version: 3785 (3e024de1)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Please consider this a low priority, can even be ignored. Vulkan and all the other improvements are really appreciated.

👤 ikawrakow commented the 2025-07-05 at 13:12:19:

This is strange. I tested llama-cli with --sampling-seq mt, and it works fine after this PR.

👤 mcm007 commented the 2025-07-05 at 18:17:15:

Indeed, just tested, llama-cli is working after this PR.

From what I see, llama-server is still crashing for both API endpoints /completion and /v1/chat/completions

curl -k ik_llamacpp:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "Once upon a time",
  "n_predict": 50
  }'

curl -k ik_llamacpp:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen_Qwen3-0.6B-Q6_K.gguf",
    "messages": [
    {
        "role": "user",
        "content": "Hello!"
    }
    ]
    }'

👤 firecoperana commented the 2025-07-06 at 00:54:04:

https://github.com/ikawrakow/ik_llama.cpp/pull/588 should fix the server crash

👤 mcm007 commented the 2025-07-06 at 06:30:29:

It works OK, thank you both!

10 KiB Raw Permalink Blame History