Files
ik_llama.cpp/github-data/issues/503 - Bug_ server_cli fails with segmentation fault.md
2025-07-23 13:31:53 +02:00

12 KiB
Raw Blame History

🐛 #503 - Bug: server/cli fails with segmentation fault

Author OneOfOne
State Closed
Created 2025-06-07
Updated 2025-06-28

Description

What happened?

Trying to run: ./build/bin/llama-cli --model /nas/llm/unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf --alias qwen3-32b-q4_k_xl.gguf --ctx-size 16768 -ctk q8_0 -ctv q8_0 -fa -amb 512 --parallel 1 --n-gpu-layers 65 --threads 12 --override-tensor exps=CPU --port 12345 -p 'whats your name'

Name and Version

Vulkan0: AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64
version: 3732 (9e567e38)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu```

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
 ./build/bin/llama-cli --model /nas/llm/unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf --alias qwen3-32b-q4_k_xl.gguf --ctx-size 16768 -ctk q8_0 -ctv q8_0 -fa -amb 512 --parallel 1 --n-gpu-layers 65 --threads 16 --override-tensor exps=CPU  --port 12345 -p "what's your name?"
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64
Log start
main: build = 3732 (9e567e38)
main: built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
main: seed  = 1749325503
llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from /nas/llm/unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3-32B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   7:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                          general.file_type u32              = 15
llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-32B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-32B.txt
llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 448
llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type q4_K:  293 tensors
llama_model_loader: - type q5_K:   35 tensors
llama_model_loader: - type q6_K:   94 tensors
llama_model_loader: - type iq4_xs:   28 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 25600
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 32.762 B
llm_load_print_meta: model size       = 18.641 GiB (4.888 BPW) 
llm_load_print_meta: repeating layers = 17.639 GiB (4.855 BPW, 31.206 B parameters)
llm_load_print_meta: general.name     = Qwen3-32B
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.63 MiB
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: AMD Radeon RX 7900 XTX (RADV NAVI31) buffer size = 18671.19 MiB
llm_load_tensors:        CPU buffer size =   417.30 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 16896
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon RX 7900 XTX (RADV NAVI31) KV buffer size =  2244.00 MiB
llama_new_context_with_model: KV self size  = 2244.00 MiB, K (q8_0): 1122.00 MiB, V (q8_0): 1122.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: AMD Radeon RX 7900 XTX (RADV NAVI31) compute buffer size =   306.75 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =   209.42 MiB
llama_new_context_with_model: graph nodes  = 1734
llama_new_context_with_model: graph splits = 779
[1]    3804384 segmentation fault (core dumped)

💬 Conversation

👤 OneOfOne commented the 2025-06-07 at 20:05:07:

this only happens with the vulkan backend, I haven't figured out how to use rocm or if it's even supported.


👤 OneOfOne commented the 2025-06-07 at 20:36:11:

Narrowed it down to -ctv / -ctk, removing them makes the model load, however even with full offloading to the GPU, it's extremely slow. 2 tps vs 35tps on lm studio (vulkan backend).


👤 Ph0rk0z commented the 2025-06-07 at 22:35:28:

Since its not a large MOE but a dense model, not sure if there is a reason to use IK for it instead of mainline.


👤 OneOfOne commented the 2025-06-08 at 02:12:36:

I wanted to play with the some of the ggufs optimized for ik_llama, so I figured I'd give it a try, doesn't explain why those options don't work and why it's extremely slow with full gpu offload.


👤 saood06 commented the 2025-06-08 at 04:56:55:

Since its not a large MOE but a dense model, not sure if there is a reason to use IK for it instead of mainline.

That is not true at all. See this (https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828) for a list of reasons on top of the new quant types and there are so many examples of performance gains over mainline, such as for batched performance see the graph in #171.

Going back to the actual issue, vulkan and rocm may be functioning well in this repo as they receive very little testing (this is the first I'm hearing of someone trying to use it) and as far as I'm aware have no development here.


👤 ikawrakow commented the 2025-06-08 at 05:04:08:

Yes, mainline is a much better place for Vulkan users. There has been zero development or updates to the Vulkan back-end since I forked the project. At that time the llama.cpp Vulkan back-end was quite immature. There has been a very active Vulkan development in mainline since then with many performance improvements. ROCm is also never tested, so unclear if it still works.

I wanted to play with the some of the ggufs optimized for ik_llama

These quantization types are not implemented in the Vulkan back-end, so it will run on the CPU. That's why you see the very low performance (and if the tensors are loaded on the GPU, it is even slower than just running CPU-only).


👤 OneOfOne commented the 2025-06-08 at 16:22:15:

Thanks for the replies and explanation, I'll close this issue for now until I get an nvidia card I guess


👤 ubergarm commented the 2025-06-28 at 22:48:25:

@OneOfOne

Thanks for giving this a try and reporting your findings. Your experience lines up with my own brief exploration which I've documented in this discussion if you have any interest: https://github.com/ikawrakow/ik_llama.cpp/discussions/562

Thanks!