Files
ik_llama.cpp/github-data/issues/539 - Bug_ garbage output.md
2025-07-23 13:31:53 +02:00

105 KiB
Raw Permalink Blame History

🐛 #539 - Bug: garbage output

Author jagusztinl
State Closed
Created 2025-06-19
Updated 2025-06-26

Description

What happened?

Please help, tried several models but there is no meaningful outut (cli and server is the same, with or w/o -rtr is the same):

@gpt:/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-Q4_0.gguf --prompt "What is the meaning of life?" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu224.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750314253 llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It llama_model_loader: - kv 3: general.finetune str = it llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 27B llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 llama_model_loader: - kv 10: gemma3.block_count u32 = 62 llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16 llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 llama_model_loader: - kv 21: tokenizer.ggml.model str = llama llama_model_loader: - kv 22: tokenizer.ggml.pre str = default llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - kv 35: general.file_type u32 = 2 llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q4_0: 427 tensors llama_model_loader: - type q4_1: 7 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 6415 llm_load_vocab: token to piece cache size = 1.9446 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 262208 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5376 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 1024 llm_load_print_meta: n_swa_pattern = 6 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 21504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 0.125 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 27.009 B llm_load_print_meta: model size = 14.539 GiB (4.624 BPW) llm_load_print_meta: general.name = Gemma-3-27B-It llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 106 '<end_of_turn>' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 248 '<0x0A>' llm_load_print_meta: EOT token = 106 '<end_of_turn>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: CPU buffer size = 14888.20 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 0.125 llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB llama_new_context_with_model: CPU output buffer size = 1.00 MiB llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB llama_new_context_with_model: graph nodes = 2052 llama_new_context_with_model: graph splits = 1

system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1

What is the meaning of life?[multimodal][multimodal][multimodal][multimodal][multimodal]

OR

alerant@gpt:/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen --prompt "What is the meaning of life?" Qwen2.5-Coder-32B-Instruct-Q4_0.gguf Qwen3-32B-Q4_0.gguf alerant@gpt:/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life?" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750314509 llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-32B llama_model_loader: - kv 3: general.basename str = Qwen3-32B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 32B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3.block_count u32 = 64 llama_model_loader: - kv 8: qwen3.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 2 llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448 llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type q4_0: 441 tensors llama_model_loader: - type q4_1: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 25600 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 32.762 B llm_load_print_meta: model size = 17.413 GiB (4.566 BPW) llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters) llm_load_print_meta: general.name = Qwen3-32B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: CPU buffer size = 17830.96 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 40960 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB llama_new_context_with_model: graph nodes = 1989 llama_new_context_with_model: graph splits = 1

system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0

What is the meaning of life?*:F+=@*GB&-4%G0'B$4HF;@E(H(C6;()@:%'8"4<-HC.&$G>)$2)536.).C5346=D=6;C41AD@BD&6D';-.:G1+;=;C!+7;A>!+:8DG466)+9#:<99)3

Name and Version

version: 3751 (8b3002bb) built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output



💬 Conversation

👤 jagusztinl commented the 2025-06-19 at 08:40:53:

I tried with IQ4_XS models (gemma) it works perfectly, maybe Q4_0 is bad. But with IQ4_XS and -rtr garbage again. What I miss?

(venv) alerant@gpt:/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-IQ4_XS.gguf -rtr --prompt "What is the meaning of life? In english please" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu224.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750322313 llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-IQ4_XS.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It llama_model_loader: - kv 3: general.finetune str = it llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 27B llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 llama_model_loader: - kv 10: gemma3.block_count u32 = 62 llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16 llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 llama_model_loader: - kv 21: tokenizer.ggml.model str = llama llama_model_loader: - kv 22: tokenizer.ggml.pre str = default llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - kv 35: general.file_type u32 = 30 llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq4_xs: 434 tensors llm_load_vocab: special tokens cache size = 6415 llm_load_vocab: token to piece cache size = 1.9446 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 262208 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5376 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 1024 llm_load_print_meta: n_swa_pattern = 6 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 21504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 0.125 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw llm_load_print_meta: model params = 27.009 B llm_load_print_meta: model size = 13.747 GiB (4.372 BPW) llm_load_print_meta: general.name = Gemma-3-27B-It llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 106 '<end_of_turn>' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 248 '<0x0A>' llm_load_print_meta: EOT token = 106 '<end_of_turn>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: CPU buffer size = 15179.85 MiB ........................................................................................ ============ Repacked 434 tensors llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 0.125 llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB llama_new_context_with_model: CPU output buffer size = 1.00 MiB llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB llama_new_context_with_model: graph nodes = 2052 llama_new_context_with_model: graph splits = 1

system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1

What is the meaning of life? In english please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please


👤 ikawrakow commented the 2025-06-19 at 08:53:16:

Can you try the latest build?


👤 jagusztinl commented the 2025-06-20 at 08:01:04:

Same, please help: :/models$ uname -a Linux gpt 6.11.0-1015-azure #1524.04.1-Ubuntu SMP Thu May 1 03:01:44 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

:/models$ gcc --version gcc (Ubuntu 14.2.0-4ubuntu224.04) 14.2.0

git clone https://github.com/ikawrakow/ik_llama.cpp.git cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF cmake --build ./build --config Release -j $(nproc)

/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life? In english please" Log start main: build = 3762 (1843ed22) main: built with cc (Ubuntu 14.2.0-4ubuntu224.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750406253 llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-32B llama_model_loader: - kv 3: general.basename str = Qwen3-32B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 32B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3.block_count u32 = 64 llama_model_loader: - kv 8: qwen3.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 2 llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448 llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type q4_0: 441 tensors llama_model_loader: - type q4_1: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 25600 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 32.762 B llm_load_print_meta: model size = 17.413 GiB (4.566 BPW) llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters) llm_load_print_meta: general.name = Qwen3-32B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: CPU buffer size = 17830.96 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 40960 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB llama_new_context_with_model: graph nodes = 1989 llama_new_context_with_model: graph splits = 1

system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0

What is the meaning of life? In english please-E4>6'236,(=+G7(@G>H$8,<F*("-D#'6:FC6.!+;1CF(B%D!-1@;8)((2+/5=>$,",E0CC*"B"61(F6<'8-,B9&


👤 jagusztinl commented the 2025-06-20 at 12:54:53:

FYI, I had this warnings during compilation:

[ 16%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o [ 16%] Built target build_info In function SHA1Update, inlined from SHA1Final at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: SHA1Transform reading 64 bytes from a region of size 0 [-Wstringop-overread] 219 | SHA1Transform(context->state, &data[i]); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type const unsigned char[64] /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function SHA1Final: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function SHA1Transform 54 | void SHA1Transform( | ^~~~~~~~~~~~~ In function SHA1Update, inlined from SHA1Final at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: SHA1Transform reading 64 bytes from a region of size 0 [-Wstringop-overread] 219 | SHA1Transform(context->state, &data[i]); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type const unsigned char[64] /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function SHA1Final: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function SHA1Transform 54 | void SHA1Transform( | ^~~~~~~~~~~~~ [ 16%] Built target sha1 [ 16%] Built target sha256 In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_128_128.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_96_96.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_256_256.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_64_64.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_192_128.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_576_512.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:1119: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1((const float16_t )&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1((const float16_t )&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:23: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp: In function void iqk_gemm_default_floats(int, int, const char, size_t, DataInfo&, int): /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1039:34: warning: this statement may fall through [-Wimplicit-fallthrough=] 1039 | case 1: mm_helper<1>(D, nq, cx, bx, info, k_step); | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:13: note: here 1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step); | ^~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:34: warning: this statement may fall through [-Wimplicit-fallthrough=] 1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step); | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1041:13: note: here 1041 | default: mm_helper<3>(D, nq, cx, bx, info, k_step); | ^~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:3082:24: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 3082 | IQK_ALWAYS_INLINE void prepare_q4_k_quants(const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:21: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp: In function void {anonymous}::mul_mat_iq1bn_q8_K64(int, const void, size_t, const DataInfo&, int) [with int nrc_y = 1]: /home/alerant/ik_llama.cpp/ggml/src/./ggml-impl.h:408:42: warning: iteration 2 invokes undefined behavior [-Waggressive-loop-optimizations] 408 | #define ggml_vdotq_s32(a, b, c) vdotq_s32(a, b, c) | ~~~~~~~~~^~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2015:31: note: in expansion of macro ggml_vdotq_s32 2015 | accd[0] = ggml_vdotq_s32(accd[0], q.val[j], v1.val[j]); | ^~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2014:35: note: within this loop 2014 | for (int j = 0; j < 4; ++j) { | ^ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8067:38: required from here 8067 | quantizer.find_best_match( amax/scale_0, xb, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ncluster [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable id [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter xb [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter weight [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of std::pair<float, float> {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_scale(const float*, const float*, const int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8068:59: required from here 8068 | auto [dp, score_p] = quantizer.find_best_scale(xb, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7514:25: note: parameter passing for argument of type std::pair<float, float> when C++17 is enabled changed to match C++14 in GCC 10.1 7514 | std::pair<float, float> QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_scale( | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = true; bool is_int = true]: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8367:42: required from here 8367 | quantizer.find_best_match(amax/(scale_0 + kStepitry), xaux, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ncluster [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable id [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter xb [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter weight [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of void {anonymous}::QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float, const float, const float*, int*) const [with int block_size = 32; int group_size = 4; int num_bits = 15; bool is_abs = false; bool is_int = true]: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8642:43: required from here 8642 | quantizer1.find_best_match( amax/(8.f*itry + scale_0), xaux, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ncluster [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable id [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter xb [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter weight [-Wunused-parameter] 7580 | void QuantizerIQKT<block_size, group_size, num_bits, is_abs, is_int>::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:17: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: always_inline function might not be inlinable unless also declared inline [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ [ 16%] Built target xxhash [ 16%] Linking CXX shared library libggml.so [ 16%] Built target ggml [ 17%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [ 18%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o [ 19%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o [ 21%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o [ 21%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o [ 22%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o [ 22%] Linking CXX executable ../../bin/llama-gguf [ 22%] Built target llama-gguf [ 23%] Linking CXX executable ../../bin/llama-gguf-hash [ 23%] Built target llama-gguf-hash ^Cgmake[2]: *** [src/CMakeFiles/llama.dir/build.make:76: src/CMakeFiles/llama.dir/llama.cpp.o] Interrupt gmake[1]: *** [CMakeFiles/Makefile2:1647: src/CMakeFiles/llama.dir/all] Interrupt gmake: *** [Makefile:146: all] Interrupt

alerant@gpt:/ik_llama.cpp$ cmake --build ./build --config Release -j $(nproc) [ 1%] Built target build_info [ 2%] Built target sha256 [ 3%] Built target xxhash [ 3%] Built target sha1 [ 16%] Built target ggml [ 18%] Built target llama-gguf-hash [ 19%] Built target llama-gguf [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [ 20%] Linking CXX shared library libllama.so [ 23%] Built target llama [ 24%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o [ 24%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o [ 25%] Building CXX object examples/benchmark/CMakeFiles/llama-bench-matmult.dir/benchmark-matmult.cpp.o [ 26%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o [ 26%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o [ 27%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o [ 28%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o [ 29%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o [ 30%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o [ 31%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o [ 31%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o [ 32%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o [ 33%] Linking C executable ../bin/test-c [ 33%] Built target test-c In file included from /usr/include/c++/14/bits/stl_algobase.h:64, from /usr/include/c++/14/bits/specfun.h:43, from /usr/include/c++/14/cmath:3898, from /usr/include/c++/14/random:40, from /home/alerant/ik_llama.cpp/src/../include/llama.h:1326, from /home/alerant/ik_llama.cpp/examples/quantize-stats/../../common/common.h:12, from /home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:9: /usr/include/c++/14/bits/stl_pair.h: In instantiation of constexpr std::pair<typename std::__strip_reference_wrapper<typename std::decay<_Tp>::type>::__type, typename std::__strip_reference_wrapper<typename std::decay<_Tp2>::type>::__type> std::make_pair(_T1&&, _T2&&) [with _T1 = float; _T2 = float; typename __strip_reference_wrapper<typename decay<_Tp>::type>::__type = float; typename decay<_Tp>::type = float; typename __strip_reference_wrapper<typename decay<_Tp2>::type>::__type = float; typename decay<_Tp2>::type = float]: /home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:392:68: required from here 392 | std::vector<std::pair<float, float>> range(ndim, std::make_pair(INFINITY, -INFINITY)); | ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /usr/include/c++/14/bits/stl_pair.h:1132:5: note: parameter passing for argument of type std::pair<float, float> when C++17 is enabled changed to match C++14 in GCC 10.1 1132 | make_pair(_T1&& __x, _T2&& __y) | ^~~~~~~~~ [ 33%] Linking CXX executable ../../bin/llama-bench-matmult [ 33%] Built target llama-bench-matmult [ 33%] Linking CXX executable ../../bin/llama-quantize-stats [ 33%] Built target llama-quantize-stats In file included from /home/alerant/ik_llama.cpp/examples/llava/clip.cpp:24: /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h: In function int stbi__parse_png_file(stbi__png*, int, int): /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5450:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 5450 | tc[k] = (stbi_uc)(stbi__get16be(s) & 255) * | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5451 | stbi__depth_scale_table[z->depth]; // non 8-bit images will be larger | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5326:28: note: at offset 3 into destination object tc of size 3 5326 | stbi_uc has_trans = 0, tc[3] = {0}; | ^ [ 33%] Built target llava


👤 jagusztinl commented the 2025-06-20 at 14:04:07:

Fixed: build with -DGGML_SVE=ON solved it

But not faster inference for any model than the current llama.cpp build on ARM CPU (pp better):

For example, on the same server:

llama.cpp: deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | pp512 | 43.27 ± 0.16 | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | tg128 | 10.97 ± 0.07 |

ik_llama.cpp:

model size params backend threads type_k type_v fa mla amb rtr fmoe test t/s
============ Repacked 611 tensors
deepseek2 671B Q4_K_R4 413.14 GiB 672.05 B CPU 64 q8_0 q8_0 1 3 2048 1 1 pp512 70.30 ± 0.08
deepseek2 671B Q4_K_R4 413.14 GiB 672.05 B CPU 64 q8_0 q8_0 1 3 2048 1 1 tg128 9.59 ± 0.02

👤 jagusztinl commented the 2025-06-20 at 14:04:07:

Fixed: build with -DGGML_SVE=ON solved it


👤 jagusztinl commented the 2025-06-20 at 14:06:39:

But not faster for any model than the current llama.cpp build on ARM CPU


👤 ikawrakow commented the 2025-06-20 at 15:50:59:

You never mentioned your are using an ARM CPU. Unlike llama.cpp, nothing is automatically set for you on ARM. It is likely you need to set arch options manually. -DGGML_SVE=ON solving your issues sounds strange to me as no usage is made of SVE anywhere in ik_llama.cpp. The only ARM implementation that exists is NEON.

A 60% difference in PP-performance is not faster on your book? And that is for the quant receiving the most love in mainline llamas.cpp, with a special purpose GEMM and GEMV implementations for ARM CPUs.

Also, PP-512 and TG-128 are very misleading measures of performance. When is it in real usage that I have zero tokens in the KV cache? Try running with something more significant in the KV cache (8k-18k tokens) and see how that goes. You may also want to try some of the i-quants.

But overall, yes, ARM CPUs are not a big focus of this project. I maintain it in a functional state, but haven't updated the ARM implementation for quite some time. It is missing the massive PP performance gains that I got on AVX2 during the last 2-3 weeks.


👤 ikawrakow commented the 2025-06-20 at 15:59:12:

Oh, what is the CPU you are using?


👤 jagusztinl commented the 2025-06-21 at 08:39:04:

Thank you for your answer, a bit detailed explanation of the project: -We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuously without limits. -We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results: Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme

The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project: | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 | | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 |

Please advise how can we further optimize Deepseek inference with your solution.


👤 jagusztinl commented the 2025-06-21 at 08:39:04:

Thank you for your answer, a bit detail explanation of the project: -We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuosly. without limits. -We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DGGML_CPU_KLEIDIAI=ON -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results: Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme

The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project: | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 | | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 |

Please advise how can we further optimize Deepseek inference with your solution.


👤 jagusztinl commented the 2025-06-21 at 08:47:17:

About the garbage problem: If I do not use -DGGML_SVE=ON during compilation, it is not detected: use system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | instead of: system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | this is the root cause of the garbage output on this server.


👤 ikawrakow commented the 2025-06-21 at 09:22:44:

I'm open to working on optimizing this project for SVE, but it is a hobby project of mine without commercial backing, so I develop/test on the CPU platforms I have access to (AVX2, Zen4, ARM_NEON on an M2-Max CPU).

What are you looking to optimize? I read somewhere that the "typical enterprise" workflow (whatever that means) involves processing N token prompts and then generating a response with N/10 tokens. Or are the prompts of your customers really short, but they are looking for long answers, so TG speed is all that matters? What about context? Your customers never have a longer exchange with the LLM but always just ask a single short question, get the answer, and close the session?


👤 saood06 commented the 2025-06-21 at 16:16:04:

So can you try experimenting with -DGGML_ARCH_FLAGS= added by #347. Some users have had some success with it see: https://github.com/ikawrakow/ik_llama.cpp/issues/345#issuecomment-2831460138. It looks like you have done similar experimenting with llama.cpp, in optimizing it.


👤 jagusztinl commented the 2025-06-23 at 15:34:50:

Using this: cmake -B ./build -DGGML_LTO=ON -DCMAKE_CXX_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DGGML_ARCH_FLAGS="-mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme"

ik_llama.cpp is winner :-) | deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | pp512 | 68.19 ± 0.16 | | deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | tg128 | 11.54 ± 0.07 |


👤 saood06 commented the 2025-06-23 at 20:40:31:

ik_llama.cpp is winner :-)

Glad you found some settings that made it perform well for you.

Why are you using MLA 2 now instead of 3 like you were previously (assuming headers stayed the same)? Also two tips, using a high ubatch size can boost PP (assuming you can make use of those larger batch sizes) and you can use sweep-bench for benchmarking and seeing how much your performance drops with context (it even comes with it's own plotting tool).

We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!)

I was going to suggest going to the 48 core 384GB version since Deepseek would still fit, but looking at the spot price the 64 core is cheaper. (I did find certain regions where it goes down to $0.413).

By my math that does seem a bit cheaper than most inference providers (even using your cost), but I think your cost advantage goes away as performance will drop as context climbs.

our collegues can use DeepSeek privately for 80USD/month continuously without limits

If your use case allows for it, you may be able to get better performance with batching, that way multiple people can be served by a single instance. Performance of that can be seen with batched-bench.


👤 ikawrakow commented the 2025-06-26 at 06:49:28:

No need to keep this open.