mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-01 20:19:52 +00:00
1301 lines
105 KiB
Markdown
1301 lines
105 KiB
Markdown
### 🐛 [#539](https://github.com/ikawrakow/ik_llama.cpp/issues/539) - Bug: garbage output
|
||
|
||
| **Author** | `jagusztinl` |
|
||
| :--- | :--- |
|
||
| **State** | ❌ **Closed** |
|
||
| **Created** | 2025-06-19 |
|
||
| **Updated** | 2025-06-26 |
|
||
|
||
---
|
||
|
||
#### Description
|
||
|
||
### What happened?
|
||
|
||
Please help, tried several models but there is no meaningful outut (cli and server is the same, with or w/o -rtr is the same):
|
||
|
||
@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-Q4_0.gguf --prompt "What is the meaning of life?"
|
||
Log start
|
||
main: build = 3751 (8b3002bb)
|
||
main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
|
||
main: seed = 1750314253
|
||
llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-Q4_0.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = gemma3
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
|
||
llama_model_loader: - kv 3: general.finetune str = it
|
||
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
|
||
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
|
||
llama_model_loader: - kv 6: general.size_label str = 27B
|
||
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
|
||
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
|
||
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
|
||
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
|
||
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
|
||
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
|
||
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
||
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
|
||
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
|
||
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
|
||
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16
|
||
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
|
||
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
|
||
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
|
||
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
|
||
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
|
||
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
|
||
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
|
||
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
|
||
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
|
||
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
|
||
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
|
||
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
|
||
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
|
||
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
|
||
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
|
||
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 35: general.file_type u32 = 2
|
||
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat
|
||
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt
|
||
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434
|
||
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
|
||
llama_model_loader: - type f32: 373 tensors
|
||
llama_model_loader: - type q4_0: 427 tensors
|
||
llama_model_loader: - type q4_1: 7 tensors
|
||
llama_model_loader: - type q6_K: 1 tensors
|
||
llm_load_vocab: special tokens cache size = 6415
|
||
llm_load_vocab: token to piece cache size = 1.9446 MB
|
||
llm_load_print_meta: format = GGUF V3 (latest)
|
||
llm_load_print_meta: arch = gemma3
|
||
llm_load_print_meta: vocab type = SPM
|
||
llm_load_print_meta: n_vocab = 262208
|
||
llm_load_print_meta: n_merges = 0
|
||
llm_load_print_meta: vocab_only = 0
|
||
llm_load_print_meta: n_ctx_train = 131072
|
||
llm_load_print_meta: n_embd = 5376
|
||
llm_load_print_meta: n_layer = 62
|
||
llm_load_print_meta: n_head = 32
|
||
llm_load_print_meta: n_head_kv = 16
|
||
llm_load_print_meta: n_rot = 128
|
||
llm_load_print_meta: n_swa = 1024
|
||
llm_load_print_meta: n_swa_pattern = 6
|
||
llm_load_print_meta: n_embd_head_k = 128
|
||
llm_load_print_meta: n_embd_head_v = 128
|
||
llm_load_print_meta: n_gqa = 2
|
||
llm_load_print_meta: n_embd_k_gqa = 2048
|
||
llm_load_print_meta: n_embd_v_gqa = 2048
|
||
llm_load_print_meta: f_norm_eps = 0.0e+00
|
||
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
|
||
llm_load_print_meta: f_clamp_kqv = 0.0e+00
|
||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
|
||
llm_load_print_meta: f_logit_scale = 0.0e+00
|
||
llm_load_print_meta: n_ff = 21504
|
||
llm_load_print_meta: n_expert = 0
|
||
llm_load_print_meta: n_expert_used = 0
|
||
llm_load_print_meta: causal attn = 1
|
||
llm_load_print_meta: pooling type = 0
|
||
llm_load_print_meta: rope type = 2
|
||
llm_load_print_meta: rope scaling = linear
|
||
llm_load_print_meta: freq_base_train = 1000000.0
|
||
llm_load_print_meta: freq_scale_train = 0.125
|
||
llm_load_print_meta: n_ctx_orig_yarn = 131072
|
||
llm_load_print_meta: rope_finetuned = unknown
|
||
llm_load_print_meta: ssm_d_conv = 0
|
||
llm_load_print_meta: ssm_d_inner = 0
|
||
llm_load_print_meta: ssm_d_state = 0
|
||
llm_load_print_meta: ssm_dt_rank = 0
|
||
llm_load_print_meta: model type = 27B
|
||
llm_load_print_meta: model ftype = Q4_0
|
||
llm_load_print_meta: model params = 27.009 B
|
||
llm_load_print_meta: model size = 14.539 GiB (4.624 BPW)
|
||
llm_load_print_meta: general.name = Gemma-3-27B-It
|
||
llm_load_print_meta: BOS token = 2 '<bos>'
|
||
llm_load_print_meta: EOS token = 106 '<end_of_turn>'
|
||
llm_load_print_meta: UNK token = 3 '<unk>'
|
||
llm_load_print_meta: PAD token = 0 '<pad>'
|
||
llm_load_print_meta: LF token = 248 '<0x0A>'
|
||
llm_load_print_meta: EOT token = 106 '<end_of_turn>'
|
||
llm_load_print_meta: max token length = 48
|
||
llm_load_tensors: ggml ctx size = 0.35 MiB
|
||
llm_load_tensors: CPU buffer size = 14888.20 MiB
|
||
.........................................................................................
|
||
llama_new_context_with_model: n_ctx = 131072
|
||
llama_new_context_with_model: n_batch = 2048
|
||
llama_new_context_with_model: n_ubatch = 512
|
||
llama_new_context_with_model: flash_attn = 0
|
||
llama_new_context_with_model: mla_attn = 0
|
||
llama_new_context_with_model: attn_max_b = 0
|
||
llama_new_context_with_model: fused_moe = 0
|
||
llama_new_context_with_model: ser = -1, 0
|
||
llama_new_context_with_model: freq_base = 1000000.0
|
||
llama_new_context_with_model: freq_scale = 0.125
|
||
llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB
|
||
llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB
|
||
llama_new_context_with_model: CPU output buffer size = 1.00 MiB
|
||
llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB
|
||
llama_new_context_with_model: graph nodes = 2052
|
||
llama_new_context_with_model: graph splits = 1
|
||
|
||
system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
sampling:
|
||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
|
||
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
|
||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
|
||
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
|
||
sampling order:
|
||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature
|
||
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1
|
||
|
||
|
||
What is the meaning of life?[multimodal][multimodal][multimodal][multimodal][multimodal]
|
||
|
||
|
||
OR
|
||
|
||
alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen --prompt "What is the meaning of life?"
|
||
Qwen2.5-Coder-32B-Instruct-Q4_0.gguf Qwen3-32B-Q4_0.gguf
|
||
alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life?"
|
||
Log start
|
||
main: build = 3751 (8b3002bb)
|
||
main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
|
||
main: seed = 1750314509
|
||
llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = qwen3
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = Qwen3-32B
|
||
llama_model_loader: - kv 3: general.basename str = Qwen3-32B
|
||
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
|
||
llama_model_loader: - kv 5: general.size_label str = 32B
|
||
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
|
||
llama_model_loader: - kv 7: qwen3.block_count u32 = 64
|
||
llama_model_loader: - kv 8: qwen3.context_length u32 = 40960
|
||
llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120
|
||
llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600
|
||
llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64
|
||
llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
|
||
llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
|
||
llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
||
llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
|
||
llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
|
||
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
|
||
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
||
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
||
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
||
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
|
||
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654
|
||
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
|
||
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
||
llama_model_loader: - kv 26: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 27: general.file_type u32 = 2
|
||
llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat
|
||
llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt
|
||
llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448
|
||
llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685
|
||
llama_model_loader: - type f32: 257 tensors
|
||
llama_model_loader: - type q4_0: 441 tensors
|
||
llama_model_loader: - type q4_1: 8 tensors
|
||
llama_model_loader: - type q6_K: 1 tensors
|
||
llm_load_vocab: special tokens cache size = 26
|
||
llm_load_vocab: token to piece cache size = 0.9311 MB
|
||
llm_load_print_meta: format = GGUF V3 (latest)
|
||
llm_load_print_meta: arch = qwen3
|
||
llm_load_print_meta: vocab type = BPE
|
||
llm_load_print_meta: n_vocab = 151936
|
||
llm_load_print_meta: n_merges = 151387
|
||
llm_load_print_meta: vocab_only = 0
|
||
llm_load_print_meta: n_ctx_train = 40960
|
||
llm_load_print_meta: n_embd = 5120
|
||
llm_load_print_meta: n_layer = 64
|
||
llm_load_print_meta: n_head = 64
|
||
llm_load_print_meta: n_head_kv = 8
|
||
llm_load_print_meta: n_rot = 128
|
||
llm_load_print_meta: n_swa = 0
|
||
llm_load_print_meta: n_swa_pattern = 1
|
||
llm_load_print_meta: n_embd_head_k = 128
|
||
llm_load_print_meta: n_embd_head_v = 128
|
||
llm_load_print_meta: n_gqa = 8
|
||
llm_load_print_meta: n_embd_k_gqa = 1024
|
||
llm_load_print_meta: n_embd_v_gqa = 1024
|
||
llm_load_print_meta: f_norm_eps = 0.0e+00
|
||
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
|
||
llm_load_print_meta: f_clamp_kqv = 0.0e+00
|
||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
|
||
llm_load_print_meta: f_logit_scale = 0.0e+00
|
||
llm_load_print_meta: n_ff = 25600
|
||
llm_load_print_meta: n_expert = 0
|
||
llm_load_print_meta: n_expert_used = 0
|
||
llm_load_print_meta: causal attn = 1
|
||
llm_load_print_meta: pooling type = 0
|
||
llm_load_print_meta: rope type = 2
|
||
llm_load_print_meta: rope scaling = linear
|
||
llm_load_print_meta: freq_base_train = 1000000.0
|
||
llm_load_print_meta: freq_scale_train = 1
|
||
llm_load_print_meta: n_ctx_orig_yarn = 40960
|
||
llm_load_print_meta: rope_finetuned = unknown
|
||
llm_load_print_meta: ssm_d_conv = 0
|
||
llm_load_print_meta: ssm_d_inner = 0
|
||
llm_load_print_meta: ssm_d_state = 0
|
||
llm_load_print_meta: ssm_dt_rank = 0
|
||
llm_load_print_meta: model type = ?B
|
||
llm_load_print_meta: model ftype = Q4_0
|
||
llm_load_print_meta: model params = 32.762 B
|
||
llm_load_print_meta: model size = 17.413 GiB (4.566 BPW)
|
||
llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters)
|
||
llm_load_print_meta: general.name = Qwen3-32B
|
||
llm_load_print_meta: BOS token = 11 ','
|
||
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
|
||
llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
|
||
llm_load_print_meta: LF token = 148848 'ÄĬ'
|
||
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
|
||
llm_load_print_meta: max token length = 256
|
||
llm_load_tensors: ggml ctx size = 0.32 MiB
|
||
llm_load_tensors: CPU buffer size = 17830.96 MiB
|
||
.................................................................................................
|
||
llama_new_context_with_model: n_ctx = 40960
|
||
llama_new_context_with_model: n_batch = 2048
|
||
llama_new_context_with_model: n_ubatch = 512
|
||
llama_new_context_with_model: flash_attn = 0
|
||
llama_new_context_with_model: mla_attn = 0
|
||
llama_new_context_with_model: attn_max_b = 0
|
||
llama_new_context_with_model: fused_moe = 0
|
||
llama_new_context_with_model: ser = -1, 0
|
||
llama_new_context_with_model: freq_base = 1000000.0
|
||
llama_new_context_with_model: freq_scale = 1
|
||
llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB
|
||
llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
|
||
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
|
||
llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB
|
||
llama_new_context_with_model: graph nodes = 1989
|
||
llama_new_context_with_model: graph splits = 1
|
||
|
||
system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
sampling:
|
||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
|
||
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
|
||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
|
||
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
|
||
sampling order:
|
||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature
|
||
generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0
|
||
|
||
|
||
What is the meaning of life?*:F+=@*GB&-4%G0'B$4HF;@E(H(C6;()@:%'8"4<-HC.&$G>)$2)536.).C5346=D=6;C41AD@BD&6D';-.:G1+;=;C!+7;A>!+:8DG466)+9#:<99)3
|
||
|
||
|
||
|
||
### Name and Version
|
||
|
||
version: 3751 (8b3002bb)
|
||
built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
|
||
|
||
|
||
### What operating system are you seeing the problem on?
|
||
|
||
Linux
|
||
|
||
### Relevant log output
|
||
|
||
```shell
|
||
|
||
```
|
||
|
||
---
|
||
|
||
#### 💬 Conversation
|
||
|
||
👤 **jagusztinl** commented the **2025-06-19** at **08:40:53**:<br>
|
||
|
||
I tried with IQ4_XS models (gemma) it works perfectly, maybe Q4_0 is bad. But with IQ4_XS and -rtr garbage again. What I miss?
|
||
|
||
(venv) alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-IQ4_XS.gguf -rtr --prompt "What is the meaning of life? In english please"
|
||
Log start
|
||
main: build = 3751 (8b3002bb)
|
||
main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
|
||
main: seed = 1750322313
|
||
llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-IQ4_XS.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = gemma3
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
|
||
llama_model_loader: - kv 3: general.finetune str = it
|
||
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
|
||
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
|
||
llama_model_loader: - kv 6: general.size_label str = 27B
|
||
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
|
||
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
|
||
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
|
||
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
|
||
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
|
||
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
|
||
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
||
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
|
||
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
|
||
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
|
||
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16
|
||
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
|
||
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
|
||
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
|
||
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
|
||
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
|
||
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
|
||
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
|
||
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
|
||
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
|
||
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
|
||
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
|
||
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
|
||
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
|
||
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
|
||
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
|
||
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 35: general.file_type u32 = 30
|
||
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat
|
||
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt
|
||
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434
|
||
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
|
||
llama_model_loader: - type f32: 373 tensors
|
||
llama_model_loader: - type q6_K: 1 tensors
|
||
llama_model_loader: - type iq4_xs: 434 tensors
|
||
llm_load_vocab: special tokens cache size = 6415
|
||
llm_load_vocab: token to piece cache size = 1.9446 MB
|
||
llm_load_print_meta: format = GGUF V3 (latest)
|
||
llm_load_print_meta: arch = gemma3
|
||
llm_load_print_meta: vocab type = SPM
|
||
llm_load_print_meta: n_vocab = 262208
|
||
llm_load_print_meta: n_merges = 0
|
||
llm_load_print_meta: vocab_only = 0
|
||
llm_load_print_meta: n_ctx_train = 131072
|
||
llm_load_print_meta: n_embd = 5376
|
||
llm_load_print_meta: n_layer = 62
|
||
llm_load_print_meta: n_head = 32
|
||
llm_load_print_meta: n_head_kv = 16
|
||
llm_load_print_meta: n_rot = 128
|
||
llm_load_print_meta: n_swa = 1024
|
||
llm_load_print_meta: n_swa_pattern = 6
|
||
llm_load_print_meta: n_embd_head_k = 128
|
||
llm_load_print_meta: n_embd_head_v = 128
|
||
llm_load_print_meta: n_gqa = 2
|
||
llm_load_print_meta: n_embd_k_gqa = 2048
|
||
llm_load_print_meta: n_embd_v_gqa = 2048
|
||
llm_load_print_meta: f_norm_eps = 0.0e+00
|
||
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
|
||
llm_load_print_meta: f_clamp_kqv = 0.0e+00
|
||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
|
||
llm_load_print_meta: f_logit_scale = 0.0e+00
|
||
llm_load_print_meta: n_ff = 21504
|
||
llm_load_print_meta: n_expert = 0
|
||
llm_load_print_meta: n_expert_used = 0
|
||
llm_load_print_meta: causal attn = 1
|
||
llm_load_print_meta: pooling type = 0
|
||
llm_load_print_meta: rope type = 2
|
||
llm_load_print_meta: rope scaling = linear
|
||
llm_load_print_meta: freq_base_train = 1000000.0
|
||
llm_load_print_meta: freq_scale_train = 0.125
|
||
llm_load_print_meta: n_ctx_orig_yarn = 131072
|
||
llm_load_print_meta: rope_finetuned = unknown
|
||
llm_load_print_meta: ssm_d_conv = 0
|
||
llm_load_print_meta: ssm_d_inner = 0
|
||
llm_load_print_meta: ssm_d_state = 0
|
||
llm_load_print_meta: ssm_dt_rank = 0
|
||
llm_load_print_meta: model type = 27B
|
||
llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw
|
||
llm_load_print_meta: model params = 27.009 B
|
||
llm_load_print_meta: model size = 13.747 GiB (4.372 BPW)
|
||
llm_load_print_meta: general.name = Gemma-3-27B-It
|
||
llm_load_print_meta: BOS token = 2 '<bos>'
|
||
llm_load_print_meta: EOS token = 106 '<end_of_turn>'
|
||
llm_load_print_meta: UNK token = 3 '<unk>'
|
||
llm_load_print_meta: PAD token = 0 '<pad>'
|
||
llm_load_print_meta: LF token = 248 '<0x0A>'
|
||
llm_load_print_meta: EOT token = 106 '<end_of_turn>'
|
||
llm_load_print_meta: max token length = 48
|
||
llm_load_tensors: ggml ctx size = 0.35 MiB
|
||
llm_load_tensors: CPU buffer size = 15179.85 MiB
|
||
........................................................................................
|
||
============ Repacked 434 tensors
|
||
llama_new_context_with_model: n_ctx = 131072
|
||
llama_new_context_with_model: n_batch = 2048
|
||
llama_new_context_with_model: n_ubatch = 512
|
||
llama_new_context_with_model: flash_attn = 0
|
||
llama_new_context_with_model: mla_attn = 0
|
||
llama_new_context_with_model: attn_max_b = 0
|
||
llama_new_context_with_model: fused_moe = 0
|
||
llama_new_context_with_model: ser = -1, 0
|
||
llama_new_context_with_model: freq_base = 1000000.0
|
||
llama_new_context_with_model: freq_scale = 0.125
|
||
llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB
|
||
llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB
|
||
llama_new_context_with_model: CPU output buffer size = 1.00 MiB
|
||
llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB
|
||
llama_new_context_with_model: graph nodes = 2052
|
||
llama_new_context_with_model: graph splits = 1
|
||
|
||
system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
sampling:
|
||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
|
||
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
|
||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
|
||
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
|
||
sampling order:
|
||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature
|
||
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1
|
||
|
||
|
||
What is the meaning of life? In english please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-06-19** at **08:53:16**:<br>
|
||
|
||
Can you try the latest build?
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-20** at **08:01:04**:<br>
|
||
|
||
Same, please help:
|
||
:~/models$ uname -a
|
||
Linux gpt 6.11.0-1015-azure #15~24.04.1-Ubuntu SMP Thu May 1 03:01:44 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
|
||
|
||
:~/models$ gcc --version
|
||
gcc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0
|
||
|
||
git clone https://github.com/ikawrakow/ik_llama.cpp.git
|
||
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF
|
||
cmake --build ./build --config Release -j $(nproc)
|
||
|
||
~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life? In english please"
|
||
Log start
|
||
main: build = 3762 (1843ed22)
|
||
main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
|
||
main: seed = 1750406253
|
||
llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = qwen3
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = Qwen3-32B
|
||
llama_model_loader: - kv 3: general.basename str = Qwen3-32B
|
||
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
|
||
llama_model_loader: - kv 5: general.size_label str = 32B
|
||
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
|
||
llama_model_loader: - kv 7: qwen3.block_count u32 = 64
|
||
llama_model_loader: - kv 8: qwen3.context_length u32 = 40960
|
||
llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120
|
||
llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600
|
||
llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64
|
||
llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
|
||
llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
|
||
llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
||
llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
|
||
llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
|
||
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
|
||
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
||
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
||
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
||
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
|
||
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654
|
||
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
|
||
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
||
llama_model_loader: - kv 26: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 27: general.file_type u32 = 2
|
||
llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat
|
||
llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt
|
||
llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448
|
||
llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685
|
||
llama_model_loader: - type f32: 257 tensors
|
||
llama_model_loader: - type q4_0: 441 tensors
|
||
llama_model_loader: - type q4_1: 8 tensors
|
||
llama_model_loader: - type q6_K: 1 tensors
|
||
llm_load_vocab: special tokens cache size = 26
|
||
llm_load_vocab: token to piece cache size = 0.9311 MB
|
||
llm_load_print_meta: format = GGUF V3 (latest)
|
||
llm_load_print_meta: arch = qwen3
|
||
llm_load_print_meta: vocab type = BPE
|
||
llm_load_print_meta: n_vocab = 151936
|
||
llm_load_print_meta: n_merges = 151387
|
||
llm_load_print_meta: vocab_only = 0
|
||
llm_load_print_meta: n_ctx_train = 40960
|
||
llm_load_print_meta: n_embd = 5120
|
||
llm_load_print_meta: n_layer = 64
|
||
llm_load_print_meta: n_head = 64
|
||
llm_load_print_meta: n_head_kv = 8
|
||
llm_load_print_meta: n_rot = 128
|
||
llm_load_print_meta: n_swa = 0
|
||
llm_load_print_meta: n_swa_pattern = 1
|
||
llm_load_print_meta: n_embd_head_k = 128
|
||
llm_load_print_meta: n_embd_head_v = 128
|
||
llm_load_print_meta: n_gqa = 8
|
||
llm_load_print_meta: n_embd_k_gqa = 1024
|
||
llm_load_print_meta: n_embd_v_gqa = 1024
|
||
llm_load_print_meta: f_norm_eps = 0.0e+00
|
||
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
|
||
llm_load_print_meta: f_clamp_kqv = 0.0e+00
|
||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
|
||
llm_load_print_meta: f_logit_scale = 0.0e+00
|
||
llm_load_print_meta: n_ff = 25600
|
||
llm_load_print_meta: n_expert = 0
|
||
llm_load_print_meta: n_expert_used = 0
|
||
llm_load_print_meta: causal attn = 1
|
||
llm_load_print_meta: pooling type = 0
|
||
llm_load_print_meta: rope type = 2
|
||
llm_load_print_meta: rope scaling = linear
|
||
llm_load_print_meta: freq_base_train = 1000000.0
|
||
llm_load_print_meta: freq_scale_train = 1
|
||
llm_load_print_meta: n_ctx_orig_yarn = 40960
|
||
llm_load_print_meta: rope_finetuned = unknown
|
||
llm_load_print_meta: ssm_d_conv = 0
|
||
llm_load_print_meta: ssm_d_inner = 0
|
||
llm_load_print_meta: ssm_d_state = 0
|
||
llm_load_print_meta: ssm_dt_rank = 0
|
||
llm_load_print_meta: model type = ?B
|
||
llm_load_print_meta: model ftype = Q4_0
|
||
llm_load_print_meta: model params = 32.762 B
|
||
llm_load_print_meta: model size = 17.413 GiB (4.566 BPW)
|
||
llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters)
|
||
llm_load_print_meta: general.name = Qwen3-32B
|
||
llm_load_print_meta: BOS token = 11 ','
|
||
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
|
||
llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
|
||
llm_load_print_meta: LF token = 148848 'ÄĬ'
|
||
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
|
||
llm_load_print_meta: max token length = 256
|
||
llm_load_tensors: ggml ctx size = 0.32 MiB
|
||
llm_load_tensors: CPU buffer size = 17830.96 MiB
|
||
.................................................................................................
|
||
llama_new_context_with_model: n_ctx = 40960
|
||
llama_new_context_with_model: n_batch = 2048
|
||
llama_new_context_with_model: n_ubatch = 512
|
||
llama_new_context_with_model: flash_attn = 0
|
||
llama_new_context_with_model: mla_attn = 0
|
||
llama_new_context_with_model: attn_max_b = 0
|
||
llama_new_context_with_model: fused_moe = 0
|
||
llama_new_context_with_model: ser = -1, 0
|
||
llama_new_context_with_model: freq_base = 1000000.0
|
||
llama_new_context_with_model: freq_scale = 1
|
||
llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB
|
||
llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
|
||
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
|
||
llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB
|
||
llama_new_context_with_model: graph nodes = 1989
|
||
llama_new_context_with_model: graph splits = 1
|
||
|
||
system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
sampling:
|
||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
|
||
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
|
||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
|
||
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
|
||
sampling order:
|
||
CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature
|
||
generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0
|
||
|
||
|
||
What is the meaning of life? In english please-E4>6'236,(=+G7(@G>H$8,<F*("-D#'6:FC6.!+;1CF(B%D!-1@;8)((2+/5=>$,",E0CC*"B"61(F6<'8-,B9&
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-20** at **12:54:53**:<br>
|
||
|
||
FYI, I had this warnings during compilation:
|
||
|
||
[ 16%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
|
||
[ 16%] Built target build_info
|
||
In function ‘SHA1Update’,
|
||
inlined from ‘SHA1Final’ at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5:
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
|
||
219 | SHA1Transform(context->state, &data[i]);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
|
||
54 | void SHA1Transform(
|
||
| ^~~~~~~~~~~~~
|
||
In function ‘SHA1Update’,
|
||
inlined from ‘SHA1Final’ at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9:
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
|
||
219 | SHA1Transform(context->state, &data[i]);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
|
||
/home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
|
||
54 | void SHA1Transform(
|
||
| ^~~~~~~~~~~~~
|
||
[ 16%] Built target sha1
|
||
[ 16%] Built target sha256
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_128_128.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_96_96.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_256_256.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_64_64.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_192_128.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_576_512.cpp:5:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:1119:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
534 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
578 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
579 | auto vm = F16::set1(*(const float16_t *)&dl->m);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
|
||
632 | auto vd = F16::set1(*(const float16_t *)&dl->d);
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp: In function ‘void iqk_gemm_default_floats(int, int, const char*, size_t, DataInfo&, int)’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1039:34: warning: this statement may fall through [-Wimplicit-fallthrough=]
|
||
1039 | case 1: mm_helper<1>(D, nq, cx, bx, info, k_step);
|
||
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:13: note: here
|
||
1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step);
|
||
| ^~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:34: warning: this statement may fall through [-Wimplicit-fallthrough=]
|
||
1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step);
|
||
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1041:13: note: here
|
||
1041 | default: mm_helper<3>(D, nq, cx, bx, info, k_step);
|
||
| ^~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:3082:24: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
3082 | IQK_ALWAYS_INLINE void prepare_q4_k_quants(const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:1:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:21:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp: In function ‘void {anonymous}::mul_mat_iq1bn_q8_K64(int, const void*, size_t, const DataInfo&, int) [with int nrc_y = 1]’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./ggml-impl.h:408:42: warning: iteration 2 invokes undefined behavior [-Waggressive-loop-optimizations]
|
||
408 | #define ggml_vdotq_s32(a, b, c) vdotq_s32(a, b, c)
|
||
| ~~~~~~~~~^~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2015:31: note: in expansion of macro ‘ggml_vdotq_s32’
|
||
2015 | accd[0] = ggml_vdotq_s32(accd[0], q.val[j], v1.val[j]);
|
||
| ^~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2014:35: note: within this loop
|
||
2014 | for (int j = 0; j < 4; ++j) {
|
||
| ~~^~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope:
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8067:38: required from here
|
||
8067 | quantizer.find_best_match( amax/scale_0, xb, weight, best_idx);
|
||
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable]
|
||
7585 | int ncluster = m_clusters.size()/kGroupSize;
|
||
| ^~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable]
|
||
7586 | float id = 1/d;
|
||
| ^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘std::pair<float, float> {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_scale(const float*, const float*, const int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8068:59: required from here
|
||
8068 | auto [dp, score_p] = quantizer.find_best_scale(xb, weight, best_idx);
|
||
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7514:25: note: parameter passing for argument of type ‘std::pair<float, float>’ when C++17 is enabled changed to match C++14 in GCC 10.1
|
||
7514 | std::pair<float, float> QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_scale(
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = true; bool is_int = true]’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8367:42: required from here
|
||
8367 | quantizer.find_best_match(amax/(scale_0 + kStep*itry), xaux, weight, best_idx);
|
||
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable]
|
||
7585 | int ncluster = m_clusters.size()/kGroupSize;
|
||
| ^~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable]
|
||
7586 | float id = 1/d;
|
||
| ^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 4; int num_bits = 15; bool is_abs = false; bool is_int = true]’:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8642:43: required from here
|
||
8642 | quantizer1.find_best_match( amax/(8.f*itry + scale_0), xaux, weight, best_idx);
|
||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable]
|
||
7585 | int ncluster = m_clusters.size()/kGroupSize;
|
||
| ^~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable]
|
||
7586 | float id = 1/d;
|
||
| ^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter]
|
||
7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const {
|
||
| ~~~~~~~~~~~~~~^~~~~~
|
||
In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.h:3,
|
||
from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:17:
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) {
|
||
| ^~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes]
|
||
805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) {
|
||
| ^~~~~~~~~~~~~~~~
|
||
[ 16%] Built target xxhash
|
||
[ 16%] Linking CXX shared library libggml.so
|
||
[ 16%] Built target ggml
|
||
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
|
||
[ 18%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o
|
||
[ 19%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o
|
||
[ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
|
||
[ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
|
||
[ 21%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
|
||
[ 21%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
|
||
[ 22%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
|
||
[ 22%] Linking CXX executable ../../bin/llama-gguf
|
||
[ 22%] Built target llama-gguf
|
||
[ 23%] Linking CXX executable ../../bin/llama-gguf-hash
|
||
[ 23%] Built target llama-gguf-hash
|
||
^Cgmake[2]: *** [src/CMakeFiles/llama.dir/build.make:76: src/CMakeFiles/llama.dir/llama.cpp.o] Interrupt
|
||
gmake[1]: *** [CMakeFiles/Makefile2:1647: src/CMakeFiles/llama.dir/all] Interrupt
|
||
gmake: *** [Makefile:146: all] Interrupt
|
||
|
||
alerant@gpt:~/ik_llama.cpp$ cmake --build ./build --config Release -j $(nproc)
|
||
[ 1%] Built target build_info
|
||
[ 2%] Built target sha256
|
||
[ 3%] Built target xxhash
|
||
[ 3%] Built target sha1
|
||
[ 16%] Built target ggml
|
||
[ 18%] Built target llama-gguf-hash
|
||
[ 19%] Built target llama-gguf
|
||
[ 20%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
|
||
[ 20%] Linking CXX shared library libllama.so
|
||
[ 23%] Built target llama
|
||
[ 24%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
|
||
[ 24%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
|
||
[ 25%] Building CXX object examples/benchmark/CMakeFiles/llama-bench-matmult.dir/benchmark-matmult.cpp.o
|
||
[ 26%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
|
||
[ 26%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
|
||
[ 27%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
|
||
[ 28%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o
|
||
[ 29%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
|
||
[ 30%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
|
||
[ 31%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
|
||
[ 31%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
|
||
[ 32%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
|
||
[ 33%] Linking C executable ../bin/test-c
|
||
[ 33%] Built target test-c
|
||
In file included from /usr/include/c++/14/bits/stl_algobase.h:64,
|
||
from /usr/include/c++/14/bits/specfun.h:43,
|
||
from /usr/include/c++/14/cmath:3898,
|
||
from /usr/include/c++/14/random:40,
|
||
from /home/alerant/ik_llama.cpp/src/../include/llama.h:1326,
|
||
from /home/alerant/ik_llama.cpp/examples/quantize-stats/../../common/common.h:12,
|
||
from /home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:9:
|
||
/usr/include/c++/14/bits/stl_pair.h: In instantiation of ‘constexpr std::pair<typename std::__strip_reference_wrapper<typename std::decay<_Tp>::type>::__type, typename std::__strip_reference_wrapper<typename std::decay<_Tp2>::type>::__type> std::make_pair(_T1&&, _T2&&) [with _T1 = float; _T2 = float; typename __strip_reference_wrapper<typename decay<_Tp>::type>::__type = float; typename decay<_Tp>::type = float; typename __strip_reference_wrapper<typename decay<_Tp2>::type>::__type = float; typename decay<_Tp2>::type = float]’:
|
||
/home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:392:68: required from here
|
||
392 | std::vector<std::pair<float, float>> range(ndim, std::make_pair(INFINITY, -INFINITY));
|
||
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
|
||
/usr/include/c++/14/bits/stl_pair.h:1132:5: note: parameter passing for argument of type ‘std::pair<float, float>’ when C++17 is enabled changed to match C++14 in GCC 10.1
|
||
1132 | make_pair(_T1&& __x, _T2&& __y)
|
||
| ^~~~~~~~~
|
||
[ 33%] Linking CXX executable ../../bin/llama-bench-matmult
|
||
[ 33%] Built target llama-bench-matmult
|
||
[ 33%] Linking CXX executable ../../bin/llama-quantize-stats
|
||
[ 33%] Built target llama-quantize-stats
|
||
In file included from /home/alerant/ik_llama.cpp/examples/llava/clip.cpp:24:
|
||
/home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h: In function ‘int stbi__parse_png_file(stbi__png*, int, int)’:
|
||
/home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5450:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
|
||
5450 | tc[k] = (stbi_uc)(stbi__get16be(s) & 255) *
|
||
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
5451 | stbi__depth_scale_table[z->depth]; // non 8-bit images will be larger
|
||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
/home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5326:28: note: at offset 3 into destination object ‘tc’ of size 3
|
||
5326 | stbi_uc has_trans = 0, tc[3] = {0};
|
||
| ^~
|
||
[ 33%] Built target llava
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:<br>
|
||
|
||
Fixed: build with -DGGML_SVE=ON solved it
|
||
|
||
But not faster inference for any model than the current llama.cpp build on ARM CPU (pp better):
|
||
|
||
For example, on the same server:
|
||
|
||
llama.cpp:
|
||
deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | pp512 | 43.27 ± 0.16 |
|
||
deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | tg128 | 10.97 ± 0.07 |
|
||
|
||
ik_llama.cpp:
|
||
| model | size | params | backend | threads | type_k | type_v | fa | mla | amb | rtr | fmoe | test | t/s |
|
||
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --: | ----: | --: | ---: | ------------: | ---------------: |
|
||
============ Repacked 611 tensors
|
||
| deepseek2 671B Q4_K_R4 | 413.14 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 3 | 2048 | 1 | 1 | pp512 | 70.30 ± 0.08 |
|
||
| deepseek2 671B Q4_K_R4 | 413.14 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 3 | 2048 | 1 | 1 | tg128 | 9.59 ± 0.02 |
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:<br>
|
||
|
||
Fixed: build with -DGGML_SVE=ON solved it
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-20** at **14:06:39**:<br>
|
||
|
||
But not faster for any model than the current llama.cpp build on ARM CPU
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-06-20** at **15:50:59**:<br>
|
||
|
||
You never mentioned your are using an ARM CPU. Unlike llama.cpp, nothing is automatically set for you on ARM. It is likely you need to set arch options manually. `-DGGML_SVE=ON` solving your issues sounds strange to me as no usage is made of SVE anywhere in `ik_llama.cpp`. The only ARM implementation that exists is NEON.
|
||
|
||
A 60% difference in PP-performance is not faster on your book? And that is for the quant receiving the most love in mainline `llamas.cpp`, with a special purpose GEMM and GEMV implementations for ARM CPUs.
|
||
|
||
Also, `PP-512` and `TG-128` are very misleading measures of performance. When is it in real usage that I have zero tokens in the KV cache? Try running with something more significant in the KV cache (8k-18k tokens) and see how that goes. You may also want to try some of the i-quants.
|
||
|
||
But overall, yes, ARM CPUs are not a big focus of this project. I maintain it in a functional state, but haven't updated the ARM implementation for quite some time. It is missing the massive PP performance gains that I got on `AVX2` during the last 2-3 weeks.
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-06-20** at **15:59:12**:<br>
|
||
|
||
Oh, what is the CPU you are using?
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:<br>
|
||
|
||
Thank you for your answer, a bit detailed explanation of the project:
|
||
-We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuously without limits.
|
||
-We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results:
|
||
Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme
|
||
|
||
The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project:
|
||
| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 |
|
||
| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 |
|
||
|
||
Please advise how can we further optimize Deepseek inference with your solution.
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:<br>
|
||
|
||
Thank you for your answer, a bit detail explanation of the project:
|
||
-We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuosly. without limits.
|
||
-We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DGGML_CPU_KLEIDIAI=ON -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results:
|
||
Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme
|
||
|
||
The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project:
|
||
| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 |
|
||
| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 |
|
||
|
||
Please advise how can we further optimize Deepseek inference with your solution.
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-21** at **08:47:17**:<br>
|
||
|
||
About the garbage problem:
|
||
If I do not use -DGGML_SVE=ON during compilation, it is not detected:
|
||
use system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
instead of:
|
||
system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
|
||
this is the root cause of the garbage output on this server.
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-06-21** at **09:22:44**:<br>
|
||
|
||
I'm open to working on optimizing this project for SVE, but it is a hobby project of mine without commercial backing, so I develop/test on the CPU platforms I have access to (`AVX2`, `Zen4`, `ARM_NEON` on an M2-Max CPU).
|
||
|
||
What are you looking to optimize? I read somewhere that the "typical enterprise" workflow (whatever that means) involves processing `N` token prompts and then generating a response with `N/10` tokens. Or are the prompts of your customers really short, but they are looking for long answers, so TG speed is all that matters? What about context? Your customers never have a longer exchange with the LLM but always just ask a single short question, get the answer, and close the session?
|
||
|
||
---
|
||
|
||
👤 **saood06** commented the **2025-06-21** at **16:16:04**:<br>
|
||
|
||
So can you try experimenting with `-DGGML_ARCH_FLAGS=` added by #347. Some users have had some success with it see: https://github.com/ikawrakow/ik_llama.cpp/issues/345#issuecomment-2831460138. It looks like you have done similar experimenting with llama.cpp, in optimizing it.
|
||
|
||
---
|
||
|
||
👤 **jagusztinl** commented the **2025-06-23** at **15:34:50**:<br>
|
||
|
||
Using this:
|
||
cmake -B ./build -DGGML_LTO=ON -DCMAKE_CXX_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DGGML_ARCH_FLAGS="-mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme"
|
||
|
||
ik_llama.cpp is winner :-)
|
||
| deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | pp512 | 68.19 ± 0.16 |
|
||
| deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | tg128 | 11.54 ± 0.07 |
|
||
|
||
---
|
||
|
||
👤 **saood06** commented the **2025-06-23** at **20:40:31**:<br>
|
||
|
||
>ik_llama.cpp is winner :-)
|
||
|
||
Glad you found some settings that made it perform well for you.
|
||
|
||
Why are you using MLA 2 now instead of 3 like you were previously (assuming headers stayed the same)? Also two tips, using a high ubatch size can boost PP (assuming you can make use of those larger batch sizes) and you can use [sweep-bench](https://github.com/ikawrakow/ik_llama.cpp/tree/main/examples/sweep-bench) for benchmarking and seeing how much your performance drops with context (it even comes with it's own plotting tool).
|
||
|
||
>We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!)
|
||
|
||
I was going to suggest going to the 48 core 384GB version since Deepseek would still fit, but looking at the spot price the 64 core is cheaper. (I did find certain regions where it goes down to $0.413).
|
||
|
||
By my math that does seem a bit cheaper than most inference providers (even using your cost), but I think your cost advantage goes away as performance will drop as context climbs.
|
||
|
||
>our collegues can use DeepSeek privately for 80USD/month continuously without limits
|
||
|
||
If your use case allows for it, you may be able to get better performance with batching, that way multiple people can be served by a single instance. Performance of that can be seen with [batched-bench](https://github.com/ikawrakow/ik_llama.cpp/tree/main/examples/batched-bench).
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-06-26** at **06:49:28**:<br>
|
||
|
||
No need to keep this open. |