### 📝 [#271](https://github.com/ikawrakow/ik_llama.cpp/issues/271) - Possible regression computing `wk_b` tensors on the fly after PR [#265](https://github.com/ikawrakow/ik_llama.cpp/issues/265) | **Author** | `ubergarm` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-03-19 | | **Updated** | 2025-03-24 | --- #### Description I was re-running some comparisons between my custom quant and unsloth `UD-Q2_K_XL` quant with the latest PRs. This is the same Thread Ripper Pro 24-core with 256GB RAM and RTX A6000 that I've been using. While the following command works fine on `68a5b604 Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)`, it crashes after `8e549b42 Allow q8_0 cache on the CPU for FlashMLA-2 (#265)`: ```bash $ ./build/bin/llama-server --version version: 3594 (8e549b42) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu $ CUDA_VISIBLE_DEVICES="0," \ ./build/bin/llama-server \ --alias unsloth/DeepSeek-R1-UD_Q2_K_XL \ --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \ -rtr \ --ctx-size 65536 \ -ctk q8_0 \ -mla 2 -fa \ -amb 512 \ -fmoe \ --n-gpu-layers 63 \ --override-tensor exps=CPU \ --parallel 1 \ --threads 24 \ --host 127.0.0.1 \ --port 8080 . . . llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q2_K: 171 tensors llama_model_loader: - type q3_K: 3 tensors llama_model_loader: - type q4_K: 306 tensors llama_model_loader: - type q6_K: 184 tensors . . . Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU llm_load_tensors: offloading 61 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 62/62 layers to GPU llm_load_tensors: CPU buffer size = 205716.00 MiB llm_load_tensors: CUDA_Host buffer size = 497.11 MiB llm_load_tensors: CUDA0 buffer size = 9885.95 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10624: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10624: /home/w/projects/ik_llama.cpp/ggml/src/ggml. c:10624: GGML_ASSERT(dst->type == GGML_TYPE_F32) failed /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10624: GGML_ASSERT(dst->type == GGML_TYPE_F32) failed ``` I'll peep at the PR #265 diff, guessing an ASSERT in the code-path related to `-ctk q8_0 -mla 2` on CPU is messing with computing `wk_b` tensors on the fly even for hybrid CPU+GPU inferencing. --- #### 💬 Conversation 👤 **ikawrakow** commented the **2025-03-20** at **04:35:11**:
Yes, sorry, PR #265 broke it. But PR #269 is supposed to have fixed it. Based on the line number of the assert, the above is without #269. --- 👤 **ikawrakow** commented the **2025-03-20** at **04:35:11**:
Yes, sorry, PR #265 broke it. But PR #269 is supposed t have fixed it. Based on the line number of the assert, the above is without #269. --- 👤 **ubergarm** commented the **2025-03-20** at **14:11:09**:
Ahh I see that PR 269 was to fix it. I should have given you the output from tip of main. It seems like an issue persists after the fix? ```bash $ ./build/bin/llama-server --version version: 3597 (127c6ee6) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu $ CUDA_VISIBLE_DEVICES="0," \ ./build/bin/llama-server \ --alias unsloth/DeepSeek-R1-UD_Q2_K_XL \ --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \ -rtr \ --ctx-size 65536 \ -ctk q8_0 \ -mla 2 -fa \ -amb 512 \ -fmoe \ --n-gpu-layers 63 \ --override-tensor exps=CPU \ --parallel 1 \ --threads 24 \ --host 127.0.0.1 \ --port 8080 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes INFO [ main] build info | tid="128132524249088" timestamp=1742479612 build=3597 commit="127c6ee6" INFO [ main] system info | tid="128132524249088" timestamp=1742479612 n_threads=24 n_threads_batch=-1 total_threads=48 system_info= "AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F 16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: additional 4 GGUFs metadata loaded. llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/De epSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf (version GGUF V3 (latest)) . . . Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU llm_load_tensors: offloading 61 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 62/62 layers to GPU llm_load_tensors: CPU buffer size = 205716.00 MiB llm_load_tensors: CUDA_Host buffer size = 497.11 MiB llm_load_tensors: CUDA0 buffer size = 9885.95 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10629: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10629: GGML_ASSERT(dst->type == GGML_TYPE_F32) fail ed /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10629: GGML_ASSERT(dst->type == GGML_TYPE_F32) failed /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10629: GGML_ASSERT(dst->type == GGML_TYPE_F32) failed ``` --- 👤 **ikawrakow** commented the **2025-03-20** at **14:15:05**:
I guess I'm getting confused myself. Too many options to keep track of. But I did put more effort into making copy/transpose/etc. work with quantized tensors in PR #272. Can you check if that works? Thanks! --- 👤 **ubergarm** commented the **2025-03-20** at **15:51:37**:
Okay, repacked a quant using new feature from PR272 and now it runs successfully testing CPU only on the 6980P. So no need to `-rtr` anymore. 1. The repacked quant branch and successfully computes `wk_b` tensors with repacked weights 2. Allows for `mmap()` so things start up much quicker and potential huge pages stuff.
Full command and output log ```bash $ git rev-parse --short HEAD 9fe6fc37 $ numactl -N 0 -m 0 \ ./build/bin/llama-server \ --alias repack/DeepSeek-R1-Q4_K_R4 \ --model /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf \ --ctx-size 32768 \ -ctk q8_0 \ -mla 2 -fa \ -amb 512 \ -fmoe \ --parallel 1 \ --threads 128 \ --numa numactl \ --host 127.0.0.1 \ --port 8080 INFO [ main] build info | tid="135113007282112" timestamp=1742485327 build=3604 commit="9fe6fc37" INFO [ main] system info | tid="135113007282112" timestamp=1742485327 n_threads=128 n_threads_batch=-1 total_threads=512 system_inf o="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf (version GGU F V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth . . . llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q4_K: 1 tensors llama_model_loader: - type q4_k_r4: 605 tensors llama_model_loader: - type q6_k_r4: 58 tensors . . . llm_load_tensors: CPU buffer size = 385689.62 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU . . . llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used llama_new_context_with_model: CPU output buffer size = 0.99 MiB llama_new_context_with_model: CPU compute buffer size = 2048.01 MiB llama_new_context_with_model: graph nodes = 8184 llama_new_context_with_model: graph splits = 1 . . . ```
Great, I'll try to repack the unsloth `Q8_0` and see if that fixes every chunk throwing `nan` on `llama-perplexity` too. --- 👤 **ikawrakow** commented the **2025-03-20** at **16:04:58**:
> Great, I'll try to repack the unsloth Q8_0 and see if that fixes every chunk throwing nan on llama-perplexity too. Are the NaNs `ik_llama.cpp` specific, or does also mainline produce NaNs with the Unsloth `Q8_0` model? --- 👤 **ubergarm** commented the **2025-03-20** at **17:07:50**:
> Are the NaNs ik_llama.cpp specific, or does also mainline produce NaNs with the Unsloth Q8_0 model? Yes, I got mainline `llama.cpp@b1b132ef` to give at a full clean `llama-perplexity` run with no NaNs with the same GGUF files:
mainline llama.cpp clean Q8_0 perplexity run ```bash ## llama.cpp mainline $ git rev-parse --short head b1b132ef $ numactl -N 0 -m 0 \ ./build/bin/llama-perplexity \ --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \ -ctk f16 -ctv f16 \ --ctx-size 512 \ --ubatch-size 512 \ -f wiki.test.raw \ --numa numactl \ --threads 80 build: 4905 (b1b132ef) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance llama_model_loader: additional 14 GGUFs metadata loaded. llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: general.file_type u32 = 7 llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3 # remove tokenizer as characters mess up my copy/paste clipboard llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 44: general.quantization_version u32 = 2 llama_model_loader: - kv 45: split.no u16 = 0 llama_model_loader: - kv 46: split.count u16 = 15 llama_model_loader: - kv 47: split.tensors.count i32 = 1025 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 664 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 664.29 GiB (8.50 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 819 load: token to piece cache size = 0.8223 MB print_info: arch = deepseek2 print_info: vocab_only = 0 print_info: n_ctx_train = 163840 print_info: n_embd = 7168 print_info: n_layer = 61 print_info: n_head = 128 print_info: n_head_kv = 128 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 192 print_info: n_embd_head_v = 128 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 24576 print_info: n_embd_v_gqa = 16384 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 18432 print_info: n_expert = 256 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = yarn print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 0.025 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 671B print_info: model params = 671.03 B print_info: general.name = DeepSeek R1 BF16 print_info: n_layer_dense_lead = 3 print_info: n_lora_q = 1536 print_info: n_lora_kv = 512 print_info: n_ff_exp = 2048 print_info: n_expert_shared = 1 print_info: expert_weights_scale = 2.5 print_info: expert_weights_norm = 1 print_info: expert_gating_func = sigmoid print_info: rope_yarn_log_mul = 0.1000 print_info: vocab type = BPE print_info: n_vocab = 129280 print_info: n_merges = 127741 print_info: BOS token = 0 '<|begin▁of▁sentence|>' print_info: EOS token = 1 '<|end▁of▁sentence|>' print_info: EOT token = 1 '<|end▁of▁sentence|>' print_info: PAD token = 128815 '<|PAD▁TOKEN|>' print_info: LF token = 201 'Ċ' print_info: FIM PRE token = 128801 '<|fim▁begin|>' print_info: FIM SUF token = 128800 '<|fim▁hole|>' print_info: FIM MID token = 128802 '<|fim▁end|>' print_info: EOG token = 1 '<|end▁of▁sentence|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: AMX model buffer size = 18214.39 MiB load_tensors: CPU_Mapped model buffer size = 45565.90 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 46661.11 MiB load_tensors: CPU_Mapped model buffer size = 28077.60 MiB .................................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 2048 llama_context: n_ctx_per_seq = 512 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 0.025 llama_context: n_ctx_per_seq (512) < n_ctx_train (163840) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 1.97 MiB init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 init: CPU KV buffer size = 9760.00 MiB llama_context: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB llama_context: CPU compute buffer size = 670.01 MiB llama_context: graph nodes = 5025 llama_context: graph splits = 1 common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) system_info: n_threads = 80 (n_threads_batch = 80) / 512 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | perplexity: tokenizing the input .. perplexity: tokenization took 724.131 ms perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4 perplexity: 60.35 seconds per pass - ETA 2 hours 21.05 minutes [1]2.5013,[2]3.2882,[3]2.3700,[4]1.9826,[5]1.7891,[6]1.6469,[7]1.5544,[8]1.4883,[9]1.4387,[10]1.3997,[11]1.3842,[12]1.4194,[13]1.4299,[14]1.5576,[15]1.6890,[16]1.7483,[17]1.9110,[18]2.0408,[19]2.0033,[20]1.9911,[21]2.0982,[22]2.0702,[23]2.0430,[24]2.0560,[25]2.0267,[26]2.0035,[27]2.0524,[28]2.0598,[29]2.1085,[30]2.1396,[31]2.1742,[32]2.1918,[33]2.2304,[34]2.2706,[35]2.3192,[36]2.3717,[37]2.4071,[38]2.4526,[39]2.4940,[40]2.5527,[41]2.5950,[42]2.6072,[43]2.6559,[44]2.6723,[45]2.7517,[46]2.8023,[47]2.7573,[48]2.7107,[49]2.6842,[50]2.7039,[51]2.7504,[52]2.7650,[53]2.8143,[54]2.8275,[55]2.8585,[56]2.8898,[57]2.9036,[58]2.9402,[59]2.9512,[60]2.9968,[61]3.0366,[62]3.0894,[63]3.1213,[64]3.1652,[65]3.1751,[66]3.1579,[67]3.1353,[68]3.1665,[69]3.1618,[70]3.1771,[71]3.1956,[72]3.2115,[73]3.2259,[74]3.2494,[75]3.2284,[76]3.1816,[77]3.1389,[78]3.1344,[79]3.1122,[80]3.0929,[81]3.0561,[82]3.0596,[83]3.0282,[84]2.9923,[85]2.9572,[86]2.9321,[87]2.9257,[88]2.8971,[89]2.8805,[90]2.8542,[91]2.8245,[92]2.7997,[93]2.7731,[94]2.7463,[95]2.7224,[96]2.7210,[97]2.7283,[98]2.7132,[99]2.6960,[100]2.6985,[101]2.6899,[102]2.7065,[103]2.7327,[104]2.7513,[105]2.7482,[106]2.7706,[107]2.7948,[108]2.8154,[109]2.8493,[110]2.8832,[111]2.9028,[112]2.8771,[113]2.8641,[114]2.8419,[115]2.8266,[116]2.8114,[117]2.7885,[118]2.7677,[119]2.7465,[120]2.7277,[121]2.7122,[122]2.6947,[123]2.6785,[124]2.6597,[125]2.6422,[126]2.6257,[127]2.6117,[128]2.6027,[129]2.5920,[130]2.5797,[131]2.5724,[132]2.5798,[133]2.5894,[134]2.5959,[135]2.6064,[136]2.6225,[137]2.6379,[138]2.6461,[139]2.6576,[140]2.6586,[141]2.6603,[142]2.6594,[143]2.6599,[144]2.6569,[145]2.6481,[146]2.6467,[147]2.6512,[148]2.6510,[149]2.6527,[150]2.6476,[151]2.6458,[152]2.6429,[153]2.6392,[154]2.6399,[155]2.6443,[156]2.6465,[157]2.6527,[158]2.6615,[159]2.6634,[160]2.6723,[161]2.6806,[162]2.6900,[163]2.6941,[164]2.7141,[165]2.7378,[166]2.7551,[167]2.7673,[168]2.7915,[169]2.8139,[170]2.8354,[171]2.8586,[172]2.8427,[173]2.8264,[174]2.8128,[175]2.7995,[176]2.7872,[177]2.7756,[178]2.7630,[179]2.7493,[180]2.7532,[181]2.7671,[182]2.7822,[183]2.7970,[184]2.8112,[185]2.8216,[186]2.8381,[187]2.8534,[188]2.8675,[189]2.8782,[190]2.8785,[191]2.8859,[192]2.8899,[193]2.8950,[194]2.9146,[195]2.9234,[196]2.9368,[197]2.9468,[198]2.9513,[199]2.9570,[200]2.9566,[201]2.9717,[202]2.9671,[203]2.9724,[204]2.9760,[205]2.9759,[206]2.9785,[207]2.9874,[208]2.9970,[209]3.0063,[210]3.0069,[211]3.0022,[212]3.0021,[213]3.0097,[214]3.0116,[215]3.0174,[216]3.0180,[217]3.0140,[218]3.0142,[219]3.0152,[220]3.0146,[221]3.0148,[222]3.0149,[223]3.0155,[224]3.0205,[225]3.0224,[226]3.0144,[227]3.0122,[228]3.0145,[229]3.0191,[230]3.0256,[231]3.0318,[232]3.0236,[233]3.0158,[234]3.0158,[235]3.0142,[236]3.0230,[237]3.0315,[238]3.0410,[239]3.0508,[240]3.0601,[241]3.0713,[242]3.0857,[243]3.0992,[244]3.1073,[245]3.1183,[246]3.1288,[247]3.1276,[248]3.1235,[249]3.1216,[250]3.1154,[251]3.1133,[252]3.1158,[253]3.1196,[254]3.1267,[255]3.1331,[256]3.1369,[257]3.1393,[258]3.1405,[259]3.1438,[260]3.1459,[261]3.1473,[262]3.1465,[263]3.1522,[264]3.1545,[265]3.1550,[266]3.1568,[267]3.1597,[268]3.1634,[269]3.1665,[270]3.1659,[271]3.1644,[272]3.1577,[273]3.1576,[274]3.1507,[275]3.1399,[276]3.1291,[277]3.1308,[278]3.1410,[279]3.1472,[280]3.1551,[281]3.1625,[282]3.1687,[283]3.1751,[284]3.1818,[285]3.1954,[286]3.1979,[287]3.2013,[288]3.2060,[289]3.2087,[290]3.2005,[291]3.1911,[292]3.1892,[293]3.1883,[294]3.1855,[295]3.1829,[296]3.1848,[297]3.1853,[298]3.1902,[299]3.1961,[300]3.1992,[301]3.2030,[302]3.2052,[303]3.2072,[304]3.2067,[305]3.2186,[306]3.2261,[307]3.2370,[308]3.2258,[309]3.2204,[310]3.2109,[311]3.2145,[312]3.2167,[313]3.2230,[314]3.2251,[315]3.2283,[316]3.2297,[317]3.2315,[318]3.2321,[319]3.2324,[320]3.2367,[321]3.2370,[322]3.2390,[323]3.2454,[324]3.2463,[325]3.2516,[326]3.2563,[327]3.2604,[328]3.2634,[329]3.2652,[330]3.2715,[331]3.2752,[332]3.2800,[333]3.2786,[334]3.2787,[335]3.2792,[336]3.2794,[337]3.2805,[338]3.2808,[339]3.2835,[340]3.2871,[341]3.2925,[342]3.3015,[343]3.3108,[344]3.3161,[345]3.3074,[346]3.2997,[347]3.2945,[348]3.2872,[349]3.2835,[350]3.2817,[351]3.2864,[352]3.3013,[353]3.3104,[354]3.3232,[355]3.3318,[356]3.3371,[357]3.3487,[358]3.3583,[359]3.3615,[360]3.3680,[361]3.3772,[362]3.3858,[363]3.3915,[364]3.3981,[365]3.4044,[366]3.4148,[367]3.4234,[368]3.4301,[369]3.4380,[370]3.4465,[371]3.4602,[372]3.4689,[373]3.4722,[374]3.4758,[375]3.4808,[376]3.4936,[377]3.5048,[378]3.5075,[379]3.5069,[380]3.5037,[381]3.5083,[382]3.5139,[383]3.5175,[384]3.5218,[385]3.5257,[386]3.5319,[387]3.5377,[388]3.5411,[389]3.5308,[390]3.5213,[391]3.5107,[392]3.5051,[393]3.4955,[394]3.4865,[395]3.4772,[396]3.4672,[397]3.4584,[398]3.4488,[399]3.4385,[400]3.4296,[401]3.4196,[402]3.4093,[403]3.4007,[404]3.3905,[405]3.3811,[406]3.3711,[407]3.3619,[408]3.3531,[409]3.3446,[410]3.3386,[411]3.3392,[412]3.3345,[413]3.3363,[414]3.3385,[415]3.3353,[416]3.3351,[417]3.3375,[418]3.3317,[419]3.3332,[420]3.3308,[421]3.3298,[422]3.3312,[423]3.3304,[424]3.3346,[425]3.3341,[426]3.3346,[427]3.3335,[428]3.3360,[429]3.3378,[430]3.3406,[431]3.3413,[432]3.3403,[433]3.3366,[434]3.3366,[435]3.3289,[436]3.3226,[437]3.3185,[438]3.3167,[439]3.3134,[440]3.3183,[441]3.3237,[442]3.3311,[443]3.3293,[444]3.3302,[445]3.3315,[446]3.3363,[447]3.3396,[448]3.3421,[449]3.3452,[450]3.3490,[451]3.3520,[452]3.3540,[453]3.3557,[454]3.3543,[455]3.3564,[456]3.3567,[457]3.3594,[458]3.3646,[459]3.3653,[460]3.3654,[461]3.3622,[462]3.3659,[463]3.3732,[464]3.3785,[465]3.3714,[466]3.3696,[467]3.3677,[468]3.3688,[469]3.3658,[470]3.3631,[471]3.3634,[472]3.3640,[473]3.3632,[474]3.3624,[475]3.3635,[476]3.3619,[477]3.3610,[478]3.3617,[479]3.3633,[480]3.3660,[481]3.3620,[482]3.3654,[483]3.3646,[484]3.3682,[485]3.3746,[486]3.3775,[487]3.3812,[488]3.3864,[489]3.3889,[490]3.3935,[491]3.3997,[492]3.4042,[493]3.4040,[494]3.4052,[495]3.4076,[496]3.4095,[497]3.4124,[498]3.4127,[499]3.4122,[500]3.4163,[501]3.4209,[502]3.4200,[503]3.4185,[504]3.4205,[505]3.4239,[506]3.4323,[507]3.4350,[508]3.4385,[509]3.4312,[510]3.4254,[511]3.4188,[512]3.4142,[513]3.4080,[514]3.4065,[515]3.4084,[516]3.4033,[517]3.4032,[518]3.4024,[519]3.4029,[520]3.4073,[521]3.4062,[522]3.4047,[523]3.4105,[524]3.4092,[525]3.4076,[526]3.4028,[527]3.3979,[528]3.3942,[529]3.3913,[530]3.3883,[531]3.3852,[532]3.3797,[533]3.3735,[534]3.3692,[535]3.3700,[536]3.3728,[537]3.3759,[538]3.3785,[539]3.3812,[540]3.3865,[541]3.3898,[542]3.3922,[543]3.3865,[544]3.3822,[545]3.3819,[546]3.3753,[547]3.3688,[548]3.3624,[549]3.3557,[550]3.3497,[551]3.3436,[552]3.3378,[553]3.3319,[554]3.3298,[555]3.3283,[556]3.3311,[557]3.3351,[558]3.3410,[559]3.3455,[560]3.3508,[561]3.3490, Final estimate: PPL = 3.3490 +/- 0.01849 llama_perf_context_print: load time = 226439.86 ms llama_perf_context_print: prompt eval time = 8320298.42 ms / 287232 tokens ( 28.97 ms per token, 34.52 tokens per second) llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_perf_context_print: total time = 8511632.28 ms / 287233 tokens ```
I tried a few combinations of sha's with and without `-rtr`, `-mla 1`, exact same command as mainline llama.cpp above, etc, but always getting NaNs with `ik_llama.cpp` so far:
ik_llama.cpp NaNs on same quant ```bash ## ik_llama.cpp@f2fb15de $ numactl -N 0 -m 0 \ ./build/bin/llama-perplexity \ --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \ -rtr \ -ctk f16 -ctv f16 \ -mla 2 -fa \ -amb 2048 \ -fmoe \ --ctx-size 512 \ --ubatch-size 512 \ -f wiki.test.raw \ --numa numactl \ --threads 80 main: build = 3596 (f2fb15de) main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu main: seed = 1742247516 llama_model_loader: additional 14 GGUFs metadata loaded. llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: general.file_type u32 = 7 llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3 # comment out tokenzier stuff for my poor clipboard llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 44: general.quantization_version u32 = 2 llama_model_loader: - kv 45: split.no u16 = 0 llama_model_loader: - kv 46: split.count u16 = 15 llama_model_loader: - kv 47: split.tensors.count i32 = 1025 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 664 tensors llm_load_vocab: special tokens cache size = 819 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 671.026 B llm_load_print_meta: model size = 664.295 GiB (8.504 BPW) llm_load_print_meta: repeating layers = 662.461 GiB (8.504 BPW, 669.173 B parameters) llm_load_print_meta: general.name = DeepSeek R1 BF16 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 llm_load_tensors: ggml ctx size = 0.42 MiB llm_load_tensors: CPU buffer size = 680237.97 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and sllama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 2 llama_new_context_with_model: attn_max_b = 2048 llama_new_context_with_model: fused_moe = 1 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CPU KV buffer size = 137.25 MiB llama_new_context_with_model: KV self size = 137.25 MiB, c^KV (f16): 137.25 MiB, kv^T: not used llama_new_context_with_model: CPU output buffer size = 1.97 MiB llama_new_context_with_model: CPU compute buffer size = 432.01 MiB llama_new_context_with_model: graph nodes = 3365 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 80 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | perplexity: tokenizing the input .. perplexity: tokenization took 912.853 ms perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4 perplexity: 21.11 seconds per pass - ETA 49.35 minutes tored in buffer CPU Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU ============ Repacked 663 tensors [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan, ```
Trying one more time with todays updates and an offline repacked quant:
Trying `ik_llama.cpp@9fe6fc37` with offline repacked quant ```bash $ git checkout ik/offline_repack $ git rev-parse --short HEAD 9fe6fc37 $ numactl -N 0 -m 0 \ ./build/bin/llama-perplexity \ --model /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf \ -ctk q8_0 \ -mla 2 -fa \ -amb 512 \ -fmoe \ --ctx-size 512 \ --ubatch-size 512 \ -f wiki.test.raw \ --seed 1337 \ --numa numactl \ --threads 128 main: build = 3604 (9fe6fc37) main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu main: seed = 1337 llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: general.file_type u32 = 207 llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 44: general.quantization_version u32 = 2 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 1 tensors llama_model_loader: - type q8_0_r8: 663 tensors llm_load_vocab: special tokens cache size = 819 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = Q8_0_R8 - 8.5 bpw llm_load_print_meta: model params = 671.026 B llm_load_print_meta: model size = 664.295 GiB (8.504 BPW) llm_load_print_meta: repeating layers = 662.461 GiB (8.504 BPW, 669.173 B parameters) llm_load_print_meta: general.name = DeepSeek R1 BF16 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 llm_load_tensors: ggml ctx size = 0.42 MiB llm_load_tensors: CPU buffer size = 680237.97 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Collama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 2 llama_new_context_with_model: attn_max_b = 512 llama_new_context_with_model: fused_moe = 1 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CPU KV buffer size = 72.91 MiB llama_new_context_with_model: KV self size = 72.91 MiB, c^KV (q8_0): 72.91 MiB, kv^T: not used llama_new_context_with_model: CPU output buffer size = 1.97 MiB llama_new_context_with_model: CPU compute buffer size = 450.01 MiB llama_new_context_with_model: graph nodes = 3487 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | perplexity: tokenizing the input .. perplexity: tokenization took 1752.8 ms perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4 perplexity: 15.91 seconds per pass - ETA 37.20 minutes mputed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan, ```
Happy to open a new ticket and copy paste this over there if that makes it easier to track. Thanks, I'm enjoying all these great features! --- 👤 **ikawrakow** commented the **2025-03-21** at **13:08:46**:
The `Computed ... and stored in buffer CPU` messages are appearing **after** the perplexity calculation has already started. Is this a race (I'm missing a synchronization somewhere)? Or is it a matter of I/O buffering because I just use `printf` while the other messages are output via `LLAMA_LOG_INFO`? If it is the former (race), this would explain the NaNs (calculation starts before the necessary tensors are ready, and it is enough to get one NaN to have all batches be NaN as the result being output is the cumulative result, not the result of the batch alone). --- 👤 **ubergarm** commented the **2025-03-21** at **15:45:57**:
Ohh I see what you're saying now. It looks like perplexity calculations have already started but it is still printing out `Computed ....`.. fwiw I'm redirecting stderr to stdout and piping it into tee to save logs and view output:
trimmed example logs ```bash $ ./myscripts/perplexity.sh 2>&1 | tee -a logs/perplexity-R1-Q8_0_R8-ik-llama-9fe6fc37.log . . . llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CPU KV buffer size = 72.91 MiB llama_new_context_with_model: KV self size = 72.91 MiB, c^KV (q8_0): 72.91 MiB, kv^T: not used llama_new_context_with_model: CPU output buffer size = 1.97 MiB llama_new_context_with_model: CPU compute buffer size = 450.01 MiB llama_new_context_with_model: graph nodes = 3487 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | perplexity: tokenizing the input .. perplexity: tokenization took 980.309 ms perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4 perplexity: 15.59 seconds per pass - ETA 36.45 minutes mputed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU [1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,^C^C ```
So right could test this possibly by: 1. flushing stderr/stdout after each `printf` 2. having some synchronization flag e.g. isTensorsReady that is set after finishing computing and storing in buffer. Then make perplexity calculations spin waiting for isTensorsReady... Ima setup a new quant cooking on the threadripper then get into the 6980P and look at that more closely this morning --- 👤 **ikawrakow** commented the **2025-03-24** at **17:48:22**:
I think this is solved now, but I keep it open because of the reported NaNs for Unsloth's `Q8_0` model. I guess it would be better to close it and open a new issue about the `Q8_0` NaNs. --- 👤 **ubergarm** commented the **2025-03-24** at **17:58:47**:
Thanks, yes feel free to close this and I will create a new issue specific to the `Q8_0` NaNs. Getting side-tracked today with the new https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 haha...