### 🐛 [#539](https://github.com/ikawrakow/ik_llama.cpp/issues/539) - Bug: garbage output | **Author** | `jagusztinl` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-06-19 | | **Updated** | 2025-06-26 | --- #### Description ### What happened? Please help, tried several models but there is no meaningful outut (cli and server is the same, with or w/o -rtr is the same): @gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-Q4_0.gguf --prompt "What is the meaning of life?" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750314253 llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It llama_model_loader: - kv 3: general.finetune str = it llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 27B llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 llama_model_loader: - kv 10: gemma3.block_count u32 = 62 llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16 llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 llama_model_loader: - kv 21: tokenizer.ggml.model str = llama llama_model_loader: - kv 22: tokenizer.ggml.pre str = default llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - kv 35: general.file_type u32 = 2 llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q4_0: 427 tensors llama_model_loader: - type q4_1: 7 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 6415 llm_load_vocab: token to piece cache size = 1.9446 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 262208 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5376 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 1024 llm_load_print_meta: n_swa_pattern = 6 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 21504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 0.125 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 27.009 B llm_load_print_meta: model size = 14.539 GiB (4.624 BPW) llm_load_print_meta: general.name = Gemma-3-27B-It llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 106 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 248 '<0x0A>' llm_load_print_meta: EOT token = 106 '' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: CPU buffer size = 14888.20 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 0.125 llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB llama_new_context_with_model: CPU output buffer size = 1.00 MiB llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB llama_new_context_with_model: graph nodes = 2052 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1 What is the meaning of life?[multimodal][multimodal][multimodal][multimodal][multimodal] OR alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen --prompt "What is the meaning of life?" Qwen2.5-Coder-32B-Instruct-Q4_0.gguf Qwen3-32B-Q4_0.gguf alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life?" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750314509 llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-32B llama_model_loader: - kv 3: general.basename str = Qwen3-32B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 32B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3.block_count u32 = 64 llama_model_loader: - kv 8: qwen3.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 2 llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448 llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type q4_0: 441 tensors llama_model_loader: - type q4_1: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 25600 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 32.762 B llm_load_print_meta: model size = 17.413 GiB (4.566 BPW) llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters) llm_load_print_meta: general.name = Qwen3-32B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: CPU buffer size = 17830.96 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 40960 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB llama_new_context_with_model: graph nodes = 1989 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0 What is the meaning of life?*:F+=@*GB&-4%G0'B$4HF;@E(H(C6;()@:%'8"4<-HC.&$G>)$2)536.).C5346=D=6;C41AD@BD&6D';-.:G1+;=;C!+7;A>!+:8DG466)+9#:<99)3 ### Name and Version version: 3751 (8b3002bb) built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu ### What operating system are you seeing the problem on? Linux ### Relevant log output ```shell ``` --- #### 💬 Conversation 👤 **jagusztinl** commented the **2025-06-19** at **08:40:53**:
I tried with IQ4_XS models (gemma) it works perfectly, maybe Q4_0 is bad. But with IQ4_XS and -rtr garbage again. What I miss? (venv) alerant@gpt:~/models$ ../ik_llama.cpp//build/bin/llama-cli -m gemma-3-27b-it-IQ4_XS.gguf -rtr --prompt "What is the meaning of life? In english please" Log start main: build = 3751 (8b3002bb) main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750322313 llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from gemma-3-27b-it-IQ4_XS.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It llama_model_loader: - kv 3: general.finetune str = it llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 27B llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 llama_model_loader: - kv 10: gemma3.block_count u32 = 62 llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 16 llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 llama_model_loader: - kv 21: tokenizer.ggml.model str = llama llama_model_loader: - kv 22: tokenizer.ggml.pre str = default llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - kv 35: general.file_type u32 = 30 llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-27b-it-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-27b-it.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 434 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq4_xs: 434 tensors llm_load_vocab: special tokens cache size = 6415 llm_load_vocab: token to piece cache size = 1.9446 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 262208 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5376 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 1024 llm_load_print_meta: n_swa_pattern = 6 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 21504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 0.125 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw llm_load_print_meta: model params = 27.009 B llm_load_print_meta: model size = 13.747 GiB (4.372 BPW) llm_load_print_meta: general.name = Gemma-3-27B-It llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 106 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 248 '<0x0A>' llm_load_print_meta: EOT token = 106 '' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: CPU buffer size = 15179.85 MiB ........................................................................................ ============ Repacked 434 tensors llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 0.125 llama_kv_cache_init: CPU KV buffer size = 63488.00 MiB llama_new_context_with_model: KV self size = 63488.00 MiB, K (f16): 31744.00 MiB, V (f16): 31744.00 MiB llama_new_context_with_model: CPU output buffer size = 1.00 MiB llama_new_context_with_model: CPU compute buffer size = 8743.51 MiB llama_new_context_with_model: graph nodes = 2052 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> top_n_sigma -> temperature generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1 What is the meaning of life? In english please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please please --- 👤 **ikawrakow** commented the **2025-06-19** at **08:53:16**:
Can you try the latest build? --- 👤 **jagusztinl** commented the **2025-06-20** at **08:01:04**:
Same, please help: :~/models$ uname -a Linux gpt 6.11.0-1015-azure #15~24.04.1-Ubuntu SMP Thu May 1 03:01:44 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux :~/models$ gcc --version gcc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 git clone https://github.com/ikawrakow/ik_llama.cpp.git cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF cmake --build ./build --config Release -j $(nproc) ~/models$ ../ik_llama.cpp//build/bin/llama-cli -m Qwen3-32B-Q4_0.gguf --prompt "What is the meaning of life? In english please" Log start main: build = 3762 (1843ed22) main: built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu main: seed = 1750406253 llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from Qwen3-32B-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-32B llama_model_loader: - kv 3: general.basename str = Qwen3-32B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 32B llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: qwen3.block_count u32 = 64 llama_model_loader: - kv 8: qwen3.context_length u32 = 40960 llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 2 llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448 llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type q4_0: 441 tensors llama_model_loader: - type q4_1: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 26 llm_load_vocab: token to piece cache size = 0.9311 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen3 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 40960 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 25600 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 40960 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 32.762 B llm_load_print_meta: model size = 17.413 GiB (4.566 BPW) llm_load_print_meta: repeating layers = 16.411 GiB (4.517 BPW, 31.206 B parameters) llm_load_print_meta: general.name = Qwen3-32B llm_load_print_meta: BOS token = 11 ',' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151654 '<|vision_pad|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: CPU buffer size = 17830.96 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 40960 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 10240.00 MiB llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 5252.01 MiB llama_new_context_with_model: graph nodes = 1989 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000 sampling order: CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature generate: n_ctx = 40960, n_batch = 2048, n_predict = -1, n_keep = 0 What is the meaning of life? In english please-E4>6'236,(=+G7(@G>H$8,$,",E0CC*"B"61(F6<'8-,B9& --- 👤 **jagusztinl** commented the **2025-06-20** at **12:54:53**:
FYI, I had this warnings during compilation: [ 16%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o [ 16%] Built target build_info In function ‘SHA1Update’, inlined from ‘SHA1Final’ at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread] 219 | SHA1Transform(context->state, &data[i]); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’ 54 | void SHA1Transform( | ^~~~~~~~~~~~~ In function ‘SHA1Update’, inlined from ‘SHA1Final’ at /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread] 219 | SHA1Transform(context->state, &data[i]); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’ /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’: /home/alerant/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’ 54 | void SHA1Transform( | ^~~~~~~~~~~~~ [ 16%] Built target sha1 [ 16%] Built target sha256 In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_128_128.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_96_96.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_256_256.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_64_64.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_192_128.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_576_512.cpp:5: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:1119: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ40::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:534:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 534 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperQ41::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:578:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 578 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:579:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 579 | auto vm = F16::set1(*(const float16_t *)&dl->m); | ^~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h: In member function ‘void {anonymous}::HelperIQ4nl::load(int, int, {anonymous}::F16::Data&, {anonymous}::F16::Data&) const’: /home/alerant/ik_llama.cpp/ggml/src/iqk/fa/iqk_fa_templates.h:632:30: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 632 | auto vd = F16::set1(*(const float16_t *)&dl->d); | ^~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:23: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp: In function ‘void iqk_gemm_default_floats(int, int, const char*, size_t, DataInfo&, int)’: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1039:34: warning: this statement may fall through [-Wimplicit-fallthrough=] 1039 | case 1: mm_helper<1>(D, nq, cx, bx, info, k_step); | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:13: note: here 1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step); | ^~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1040:34: warning: this statement may fall through [-Wimplicit-fallthrough=] 1040 | case 2: mm_helper<2>(D, nq, cx, bx, info, k_step); | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_floats.cpp:1041:13: note: here 1041 | default: mm_helper<3>(D, nq, cx, bx, info, k_step); | ^~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_legacy_quants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_iqk_quants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:3082:24: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 3082 | IQK_ALWAYS_INLINE void prepare_q4_k_quants(const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_kquants.cpp:1: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:21: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp: In function ‘void {anonymous}::mul_mat_iq1bn_q8_K64(int, const void*, size_t, const DataInfo&, int) [with int nrc_y = 1]’: /home/alerant/ik_llama.cpp/ggml/src/./ggml-impl.h:408:42: warning: iteration 2 invokes undefined behavior [-Waggressive-loop-optimizations] 408 | #define ggml_vdotq_s32(a, b, c) vdotq_s32(a, b, c) | ~~~~~~~~~^~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2015:31: note: in expansion of macro ‘ggml_vdotq_s32’ 2015 | accd[0] = ggml_vdotq_s32(accd[0], q.val[j], v1.val[j]); | ^~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_1bit.cpp:2014:35: note: within this loop 2014 | for (int j = 0; j < 4; ++j) { | ~~^~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_gemm_floats.h:3, from /home/alerant/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:23: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h: At global scope: /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/./iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]’: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8067:38: required from here 8067 | quantizer.find_best_match( amax/scale_0, xb, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘std::pair {anonymous}::QuantizerIQKT::find_best_scale(const float*, const float*, const int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = false; bool is_int = true]’: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8068:59: required from here 8068 | auto [dp, score_p] = quantizer.find_best_scale(xb, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7514:25: note: parameter passing for argument of type ‘std::pair’ when C++17 is enabled changed to match C++14 in GCC 10.1 7514 | std::pair QuantizerIQKT::find_best_scale( | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 8; int num_bits = 16; bool is_abs = true; bool is_int = true]’: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8367:42: required from here 8367 | quantizer.find_best_match(amax/(scale_0 + kStep*itry), xaux, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In instantiation of ‘void {anonymous}::QuantizerIQKT::find_best_match(float, const float*, const float*, int*) const [with int block_size = 32; int group_size = 4; int num_bits = 15; bool is_abs = false; bool is_int = true]’: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8642:43: required from here 8642 | quantizer1.find_best_match( amax/(8.f*itry + scale_0), xaux, weight, best_idx); | ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7585:9: warning: unused variable ‘ncluster’ [-Wunused-variable] 7585 | int ncluster = m_clusters.size()/kGroupSize; | ^~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7586:11: warning: unused variable ‘id’ [-Wunused-variable] 7586 | float id = 1/d; | ^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:110: warning: unused parameter ‘xb’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:7580:128: warning: unused parameter ‘weight’ [-Wunused-parameter] 7580 | void QuantizerIQKT::find_best_match(float d, const float * xb, const float * weight, int * best_idx) const { | ~~~~~~~~~~~~~~^~~~~~ In file included from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_gemm_ktquants.h:3, from /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:17: /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:851:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 851 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants_r8(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x2_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:840:31: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 840 | static IQK_ALWAYS_INLINE void prepare_iq4_nl_quants(const int8x16_t& values, const uint8x16_t& m4, const uint8x16x4_t& bits, int8x16_t * qx) { | ^~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:831:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 831 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16_t& y) { | ^~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:818:38: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 818 | static IQK_ALWAYS_INLINE int32x4x2_t interleaved_dotq_b16(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/ggml/src/iqk/iqk_common.h:805:36: warning: ‘always_inline’ function might not be inlinable unless also declared ‘inline’ [-Wattributes] 805 | static IQK_ALWAYS_INLINE int32x4_t interleaved_dotq(const int8x16_t * qx, const int8x16x2_t& y) { | ^~~~~~~~~~~~~~~~ [ 16%] Built target xxhash [ 16%] Linking CXX shared library libggml.so [ 16%] Built target ggml [ 17%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [ 18%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o [ 19%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o [ 21%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o [ 21%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o [ 22%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o [ 22%] Linking CXX executable ../../bin/llama-gguf [ 22%] Built target llama-gguf [ 23%] Linking CXX executable ../../bin/llama-gguf-hash [ 23%] Built target llama-gguf-hash ^Cgmake[2]: *** [src/CMakeFiles/llama.dir/build.make:76: src/CMakeFiles/llama.dir/llama.cpp.o] Interrupt gmake[1]: *** [CMakeFiles/Makefile2:1647: src/CMakeFiles/llama.dir/all] Interrupt gmake: *** [Makefile:146: all] Interrupt alerant@gpt:~/ik_llama.cpp$ cmake --build ./build --config Release -j $(nproc) [ 1%] Built target build_info [ 2%] Built target sha256 [ 3%] Built target xxhash [ 3%] Built target sha1 [ 16%] Built target ggml [ 18%] Built target llama-gguf-hash [ 19%] Built target llama-gguf [ 20%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [ 20%] Linking CXX shared library libllama.so [ 23%] Built target llama [ 24%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o [ 24%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o [ 25%] Building CXX object examples/benchmark/CMakeFiles/llama-bench-matmult.dir/benchmark-matmult.cpp.o [ 26%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o [ 26%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o [ 27%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o [ 28%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o [ 29%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o [ 30%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o [ 31%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o [ 31%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o [ 32%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o [ 33%] Linking C executable ../bin/test-c [ 33%] Built target test-c In file included from /usr/include/c++/14/bits/stl_algobase.h:64, from /usr/include/c++/14/bits/specfun.h:43, from /usr/include/c++/14/cmath:3898, from /usr/include/c++/14/random:40, from /home/alerant/ik_llama.cpp/src/../include/llama.h:1326, from /home/alerant/ik_llama.cpp/examples/quantize-stats/../../common/common.h:12, from /home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:9: /usr/include/c++/14/bits/stl_pair.h: In instantiation of ‘constexpr std::pair::type>::__type, typename std::__strip_reference_wrapper::type>::__type> std::make_pair(_T1&&, _T2&&) [with _T1 = float; _T2 = float; typename __strip_reference_wrapper::type>::__type = float; typename decay<_Tp>::type = float; typename __strip_reference_wrapper::type>::__type = float; typename decay<_Tp2>::type = float]’: /home/alerant/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:392:68: required from here 392 | std::vector> range(ndim, std::make_pair(INFINITY, -INFINITY)); | ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /usr/include/c++/14/bits/stl_pair.h:1132:5: note: parameter passing for argument of type ‘std::pair’ when C++17 is enabled changed to match C++14 in GCC 10.1 1132 | make_pair(_T1&& __x, _T2&& __y) | ^~~~~~~~~ [ 33%] Linking CXX executable ../../bin/llama-bench-matmult [ 33%] Built target llama-bench-matmult [ 33%] Linking CXX executable ../../bin/llama-quantize-stats [ 33%] Built target llama-quantize-stats In file included from /home/alerant/ik_llama.cpp/examples/llava/clip.cpp:24: /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h: In function ‘int stbi__parse_png_file(stbi__png*, int, int)’: /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5450:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 5450 | tc[k] = (stbi_uc)(stbi__get16be(s) & 255) * | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5451 | stbi__depth_scale_table[z->depth]; // non 8-bit images will be larger | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/alerant/ik_llama.cpp/examples/llava/../../common/stb_image.h:5326:28: note: at offset 3 into destination object ‘tc’ of size 3 5326 | stbi_uc has_trans = 0, tc[3] = {0}; | ^~ [ 33%] Built target llava --- 👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:
Fixed: build with -DGGML_SVE=ON solved it But not faster inference for any model than the current llama.cpp build on ARM CPU (pp better): For example, on the same server: llama.cpp: deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | pp512 | 43.27 ± 0.16 | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | CPU | 99 | 1 | tg128 | 10.97 ± 0.07 | ik_llama.cpp: | model | size | params | backend | threads | type_k | type_v | fa | mla | amb | rtr | fmoe | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --: | ----: | --: | ---: | ------------: | ---------------: | ============ Repacked 611 tensors | deepseek2 671B Q4_K_R4 | 413.14 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 3 | 2048 | 1 | 1 | pp512 | 70.30 ± 0.08 | | deepseek2 671B Q4_K_R4 | 413.14 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 3 | 2048 | 1 | 1 | tg128 | 9.59 ± 0.02 | --- 👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:
Fixed: build with -DGGML_SVE=ON solved it --- 👤 **jagusztinl** commented the **2025-06-20** at **14:06:39**:
But not faster for any model than the current llama.cpp build on ARM CPU --- 👤 **ikawrakow** commented the **2025-06-20** at **15:50:59**:
You never mentioned your are using an ARM CPU. Unlike llama.cpp, nothing is automatically set for you on ARM. It is likely you need to set arch options manually. `-DGGML_SVE=ON` solving your issues sounds strange to me as no usage is made of SVE anywhere in `ik_llama.cpp`. The only ARM implementation that exists is NEON. A 60% difference in PP-performance is not faster on your book? And that is for the quant receiving the most love in mainline `llamas.cpp`, with a special purpose GEMM and GEMV implementations for ARM CPUs. Also, `PP-512` and `TG-128` are very misleading measures of performance. When is it in real usage that I have zero tokens in the KV cache? Try running with something more significant in the KV cache (8k-18k tokens) and see how that goes. You may also want to try some of the i-quants. But overall, yes, ARM CPUs are not a big focus of this project. I maintain it in a functional state, but haven't updated the ARM implementation for quite some time. It is missing the massive PP performance gains that I got on `AVX2` during the last 2-3 weeks. --- 👤 **ikawrakow** commented the **2025-06-20** at **15:59:12**:
Oh, what is the CPU you are using? --- 👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:
Thank you for your answer, a bit detailed explanation of the project: -We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuously without limits. -We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results: Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project: | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 | | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 | Please advise how can we further optimize Deepseek inference with your solution. --- 👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:
Thank you for your answer, a bit detail explanation of the project: -We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuosly. without limits. -We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DGGML_CPU_KLEIDIAI=ON -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results: Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project: | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 | | deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 | Please advise how can we further optimize Deepseek inference with your solution. --- 👤 **jagusztinl** commented the **2025-06-21** at **08:47:17**:
About the garbage problem: If I do not use -DGGML_SVE=ON during compilation, it is not detected: use system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | instead of: system_info: n_threads = 64 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | this is the root cause of the garbage output on this server. --- 👤 **ikawrakow** commented the **2025-06-21** at **09:22:44**:
I'm open to working on optimizing this project for SVE, but it is a hobby project of mine without commercial backing, so I develop/test on the CPU platforms I have access to (`AVX2`, `Zen4`, `ARM_NEON` on an M2-Max CPU). What are you looking to optimize? I read somewhere that the "typical enterprise" workflow (whatever that means) involves processing `N` token prompts and then generating a response with `N/10` tokens. Or are the prompts of your customers really short, but they are looking for long answers, so TG speed is all that matters? What about context? Your customers never have a longer exchange with the LLM but always just ask a single short question, get the answer, and close the session? --- 👤 **saood06** commented the **2025-06-21** at **16:16:04**:
So can you try experimenting with `-DGGML_ARCH_FLAGS=` added by #347. Some users have had some success with it see: https://github.com/ikawrakow/ik_llama.cpp/issues/345#issuecomment-2831460138. It looks like you have done similar experimenting with llama.cpp, in optimizing it. --- 👤 **jagusztinl** commented the **2025-06-23** at **15:34:50**:
Using this: cmake -B ./build -DGGML_LTO=ON -DCMAKE_CXX_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DGGML_ARCH_FLAGS="-mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme" ik_llama.cpp is winner :-) | deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | pp512 | 68.19 ± 0.16 | | deepseek2 671B Q4_0 | 354.49 GiB | 672.05 B | CPU | 64 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | tg128 | 11.54 ± 0.07 | --- 👤 **saood06** commented the **2025-06-23** at **20:40:31**:
>ik_llama.cpp is winner :-) Glad you found some settings that made it perform well for you. Why are you using MLA 2 now instead of 3 like you were previously (assuming headers stayed the same)? Also two tips, using a high ubatch size can boost PP (assuming you can make use of those larger batch sizes) and you can use [sweep-bench](https://github.com/ikawrakow/ik_llama.cpp/tree/main/examples/sweep-bench) for benchmarking and seeing how much your performance drops with context (it even comes with it's own plotting tool). >We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) I was going to suggest going to the 48 core 384GB version since Deepseek would still fit, but looking at the spot price the 64 core is cheaper. (I did find certain regions where it goes down to $0.413). By my math that does seem a bit cheaper than most inference providers (even using your cost), but I think your cost advantage goes away as performance will drop as context climbs. >our collegues can use DeepSeek privately for 80USD/month continuously without limits If your use case allows for it, you may be able to get better performance with batching, that way multiple people can be served by a single instance. Performance of that can be seen with [batched-bench](https://github.com/ikawrakow/ik_llama.cpp/tree/main/examples/batched-bench). --- 👤 **ikawrakow** commented the **2025-06-26** at **06:49:28**:
No need to keep this open.