### šŸ—£ļø [#393](https://github.com/ikawrakow/ik_llama.cpp/discussions/393) - Creating quantized models | **Author** | `nux` | | :--- | :--- | | **Created** | 2025-05-07 | | **Updated** | 2025-05-29 | --- #### Description Hello, I've been experimenting with the `ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4` model on a system with a 3090 GPU and dual Epyc 9115 CPUs (768GB DDR5 5600). The model runs smoothly with ~11 TPS for short prompts and ~62 TPS eval / 8.65 TPS gen at 17k tokens. It uses ~22.9GB GPU and ~395GB RAM. I'm exploring whether higher-precision quantization (e.g., `q5_k_r4`, `q4_k_r4`, `q3_k_r4`) could improve quality while using more RAM. I modified the quantization script from ubergarm's Hugging Face page, but I'm not confident in my approach. What I did so far: - Created a custom quantization config to use `q8_0` for most layers and lower-precision types for Routed Experts (e.g., `q5_k_r4`, `q4_k_r4`, `q3_k_r4`) in different layer ranges. - Ran the quantization command (see below). quantization script: ```bash #!/usr/bin/env bash custom=" # Token embedding (GPU) token_embd\.weight=q8_0 # output tensors (GPU) output\.weight=q8_0 output_norm\.weight=q8_0 # First 3 dense layers (0-3) (GPU) blk\.[0-2]\..*=q8_0 # All attention, weights, and bias tensors (GPU) blk\.[3-9]\.attn_.*=q8_0 blk\.[1-5][0-9]\.attn_.*=q8_0 blk\.60\.attn_.*=q8_0 blk\.[3-9]\.ffn_norm\.weight=q8_0 blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0 blk\.60\.ffn_norm\.weight=q8_0 blk\.[3-9]\.exp_probs_b\.bias=q8_0 blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0 blk\.60\.exp_probs_b\.bias=q8_0 # Shared Experts (GPU) blk\.[3-9]\.ffn_down_shexp\.weight=q8_0 blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0 blk\.60\.ffn_down_shexp\.weight=q8_0 blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0 blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0 blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts - Early layers (3-20) (CPU) blk\.[3-9]\.ffn_down_exps\.weight=q5_k_r4 blk\.1[0-9]\.ffn_down_exps\.weight=q5_k_r4 blk\.[3-9]\.ffn_(gate|up)_exps\.weight=q5_k_r4 blk\.1[0-9]\.ffn_(gate|up)_exps\.weight=q5_k_r4 # Routed Experts - Middle layers (21-40) (CPU) blk\.2[0-9]\.ffn_down_exps\.weight=q4_k_r4 blk\.3[0-9]\.ffn_down_exps\.weight=q4_k_r4 blk\.2[0-9]\.ffn_(gate|up)_exps\.weight=q4_k_r4 blk\.3[0-9]\.ffn_(gate|up)_exps\.weight=q4_k_r4 # Routed Experts - Later layers (41-60) (CPU) blk\.4[0-9]\.ffn_down_exps\.weight=q3_k_r4 blk\.5[0-9]\.ffn_down_exps\.weight=q3_k_r4 blk\.60\.ffn_down_exps\.weight=q3_k_r4 blk\.4[0-9]\.ffn_(gate|up)_exps\.weight=q3_k_r4 blk\.5[0-9]\.ffn_(gate|up)_exps\.weight=q3_k_r4 blk\.60\.ffn_(gate|up)_exps\.weight=q3_k_r4 " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) /home/nux/dev/ik_llama.cpp/build/bin/llama-quantize \ --imatrix /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \ --token-embedding-type q8_0 \ --output-tensor-type q8_0 \ --custom-q "$custom" \ /mnt/amp/models/unsloth/DeepSeek-V3-0324-GGUF/BF16/DeepSeek-V3-0324-BF16-00001-of-00030.gguf \ /mnt/nvme/models/nux/DeepSeek-V3-0324/DeepSeek-V3-0324-OPTIMIZED.gguf \ Q5_K_M \ 24 ``` perplexity: ```bash # ./build/bin/llama-perplexity --model /mnt/nvme/models/nux/DeepSeek-V3-0324/DeepSeek-V3-0324-OPTIMIZED.gguf -ctk q8_0 -mla 2 -fa -amb 512 -fmoe --ctx-size 512 --ubatch-size 512 -f wiki.test.raw --seed 1337 --n-gpu-layers 63 --override-tensor exps=CPU --threads 32 >> /mnt/nvme/models/nux/DeepSeek-V3-0324/perp.txt 2>&1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes main: build = 3668 (6c23618c) main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu main: seed = 1337 llama_model_loader: loaded meta data with 50 key-value pairs and 1025 tensors from /mnt/nvme/models/nux/DeepSeek-V3-0324/DeepSeek-V3-0324-OPTIMIZED.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 7: deepseek2.block_count u32 = 61 llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 16: general.file_type u32 = 17 llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 131 'Ƅ' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 llm_load_tensors: ggml ctx size = 0.85 MiB Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU llm_load_tensors: offloading 61 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 62/62 layers to GPU llm_load_tensors: CPU buffer size = 358335.85 MiB llm_load_tensors: CPU buffer size = 938.98 MiB llm_load_tensors: CUDA0 buffer size = 16706.99 MiB .................................................................................................... ============ llm_load_tensors: need to compute 61 wk_b tensors Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.52.attn_v_b.weight as 128 x 512 x 128llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 2 llama_new_context_with_model: attn_max_b = 512 llama_new_context_with_model: fused_moe = 1 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CUDA0 KV buffer size = 72.94 MiB llama_new_context_with_model: KV self size = 72.91 MiB, c^KV (q8_0): 72.91 MiB, kv^T: not used llama_new_context_with_model: CUDA_Host output buffer size = 1.97 MiB llama_new_context_with_model: CUDA0 compute buffer size = 503.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 162.01 MiB llama_new_context_with_model: graph nodes = 3548 llama_new_context_with_model: graph splits = 118 system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | perplexity: tokenizing the input .. perplexity: tokenization took 697.458 ms perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4 perplexity: 23.76 seconds per pass - ETA 55.53 minutes and stored in buffer CUDA0 Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 [1]2.4727,[2]3.1883,[3]2.3363,[4]1.9646,[5]1.7720,[6]1.6345,[7]1.5490,[8]1.4876,[9]1.4439,[10]1.4068,[11]1.3941,[12]1.4216,[13]1.4365,[14]1.5616,[15]1.6908,[16]1.7475,[17]1.9062,[18]2.0309,[19]1.9925,[20]1.9848,[21]2.0866,[22]2.0605,[23]2.0343,[24]2.0436,[25]2.0133,[26]1.9922,[27]2.0366,[28]2.0458,[29]2.0933,[30]2.1265,[31]2.1579,[32]2.1765,[33]2.2166,[34]2.2604,[35]2.3081,[36]2.3588,[37]2.3937,[38]2.4407,[39]2.4825,[40]2.5397,[41]2.5797,[42]2.5937,[43]2.6436,[44]2.6593,[45]2.7400,[46]2.7900,[47]2.7470,[48]2.7026,[49]2.6785,[50]2.6973,[51]2.7409,[52]2.7529,[53]2.8040,[54]2.8183,[55]2.8495,[56]2.8797,[57]2.8936,[58]2.9262,[59]2.9363,[60]2.9827,[61]3.0216,[62]3.0687,[63]3.1002,[64]3.1411,[65]3.1497,[66]3.1340,[67]3.1110,[68]3.1375,[69]3.1332,[70]3.1463,[71]3.1640,[72]3.1775,[73]3.1921,[74]3.2131,[75]3.1924,[76]3.1473,[77]3.1041,[78]3.0989,[79]3.0800,[80]3.0627,[81]3.0285,[82]3.0319,[83]3.0028,[84]2.9684,[85]2.9355,[86]2.9123,[87]2.9089,[88]2.8820,[89]2.8663,[90]2.8415,[91]2.8139,[92]2.7901,[93]2.7646,[94]2.7414,[95]2.7197,[96]2.7192,[97]2.7244,[98]2.7108,[99]2.6936,[100]2.6959,[101]2.6885,[102]2.7034,[103]2.7290,[104]2.7462,[105]2.7423,[106]2.7650,[107]2.7894,[108]2.8104,[109]2.8429,[110]2.8765,[111]2.8947,[112]2.8690,[113]2.8561,[114]2.8345,[115]2.8196,[116]2.8057,[117]2.7833,[118]2.7622,[119]2.7421,[120]2.7239,[121]2.7077,[122]2.6904,[123]2.6725,[124]2.6545,[125]2.6375,[126]2.6206,[127]2.6076,[128]2.6003,[129]2.5900,[130]2.5776,[131]2.5692,[132]2.5755,[133]2.5854,[134]2.5922,[135]2.6030,[136]2.6172,[137]2.6324,[138]2.6405,[139]2.6515,[140]2.6517,[141]2.6533,[142]2.6521,[143]2.6532,[144]2.6503,[145]2.6423,[146]2.6405,[147]2.6450,[148]2.6447,[149]2.6459,[150]2.6403,[151]2.6382,[152]2.6350,[153]2.6309,[154]2.6307,[155]2.6342,[156]2.6359,[157]2.6413,[158]2.6493,[159]2.6519,[160]2.6605,[161]2.6687,[162]2.6789,[163]2.6847,[164]2.7049,[165]2.7283,[166]2.7454,[167]2.7577,[168]2.7815,[169]2.8036,[170]2.8256,[171]2.8470,[172]2.8316,[173]2.8155,[174]2.8027,[175]2.7913,[176]2.7800,[177]2.7689,[178]2.7561,[179]2.7428,[180]2.7464,[181]2.7605,[182]2.7758,[183]2.7898,[184]2.8028,[185]2.8126,[186]2.8284,[187]2.8436,[188]2.8572,[189]2.8675,[190]2.8683,[191]2.8753,[192]2.8781,[193]2.8831,[194]2.9024,[195]2.9111,[196]2.9239,[197]2.9337,[198]2.9382,[199]2.9439,[200]2.9431,[201]2.9578,[202]2.9531,[203]2.9586,[204]2.9614,[205]2.9613,[206]2.9641,[207]2.9723,[208]2.9814,[209]2.9901,[210]2.9901,[211]2.9852,[212]2.9856,[213]2.9933,[214]2.9953,[215]3.0008,[216]3.0011,[217]2.9963,[218]2.9963,[219]2.9970,[220]2.9968,[221]2.9971,[222]2.9969,[223]2.9976,[224]3.0023,[225]3.0043,[226]2.9962,[227]2.9937,[228]2.9952,[229]2.9988,[230]3.0051,[231]3.0109,[232]3.0025,[233]2.9956,[234]2.9957,[235]2.9947,[236]3.0036,[237]3.0116,[238]3.0208,[239]3.0302,[240]3.0399,[241]3.0508,[242]3.0647,[243]3.0770,[244]3.0851,[245]3.0963,[246]3.1068,[247]3.1054,[248]3.1011,[249]3.0992,[250]3.0935,[251]3.0914,[252]3.0934,[253]3.0973,[254]3.1043,[255]3.1104,[256]3.1135,[257]3.1165,[258]3.1179,[259]3.1212,[260]3.1237,[261]3.1252,[262]3.1242,[263]3.1294,[264]3.1317,[265]3.1323,[266]3.1339,[267]3.1358,[268]3.1393,[269]3.1422,[270]3.1411,[271]3.1398,[272]3.1335,[273]3.1335,[274]3.1269,[275]3.1164,[276]3.1056,[277]3.1077,[278]3.1175,[279]3.1234,[280]3.1308,[281]3.1380,[282]3.1437,[283]3.1499,[284]3.1560,[285]3.1695,[286]3.1715,[287]3.1744,[288]3.1794,[289]3.1816,[290]3.1739,[291]3.1657,[292]3.1642,[293]3.1634,[294]3.1613,[295]3.1588,[296]3.1604,[297]3.1609,[298]3.1663,[299]3.1721,[300]3.1751,[301]3.1791,[302]3.1807,[303]3.1819,[304]3.1812,[305]3.1926,[306]3.1994,[307]3.2100,[308]3.1990,[309]3.1939,[310]3.1849,[311]3.1880,[312]3.1899,[313]3.1952,[314]3.1972,[315]3.2003,[316]3.2015,[317]3.2034,[318]3.2037,[319]3.2039,[320]3.2080,[321]3.2082,[322]3.2100,[323]3.2165,[324]3.2173,[325]3.2222,[326]3.2267,[327]3.2307,[328]3.2330,[329]3.2346,[330]3.2409,[331]3.2441,[332]3.2479,[333]3.2468,[334]3.2467,[335]3.2473,[336]3.2474,[337]3.2485,[338]3.2487,[339]3.2513,[340]3.2549,[341]3.2603,[342]3.2689,[343]3.2779,[344]3.2827,[345]3.2742,[346]3.2666,[347]3.2621,[348]3.2550,[349]3.2515,[350]3.2501,[351]3.2547,[352]3.2692,[353]3.2779,[354]3.2904,[355]3.2987,[356]3.3044,[357]3.3155,[358]3.3253,[359]3.3282,[360]3.3344,[361]3.3436,[362]3.3520,[363]3.3572,[364]3.3639,[365]3.3698,[366]3.3798,[367]3.3882,[368]3.3948,[369]3.4023,[370]3.4109,[371]3.4240,[372]3.4325,[373]3.4360,[374]3.4391,[375]3.4440,[376]3.4563,[377]3.4673,[378]3.4701,[379]3.4702,[380]3.4668,[381]3.4717,[382]3.4772,[383]3.4805,[384]3.4847,[385]3.4884,[386]3.4943,[387]3.5002,[388]3.5032,[389]3.4930,[390]3.4839,[391]3.4739,[392]3.4686,[393]3.4593,[394]3.4508,[395]3.4419,[396]3.4322,[397]3.4236,[398]3.4143,[399]3.4042,[400]3.3954,[401]3.3858,[402]3.3757,[403]3.3675,[404]3.3577,[405]3.3485,[406]3.3388,[407]3.3295,[408]3.3209,[409]3.3126,[410]3.3069,[411]3.3080,[412]3.3036,[413]3.3058,[414]3.3077,[415]3.3051,[416]3.3051,[417]3.3071,[418]3.3014,[419]3.3025,[420]3.2998,[421]3.2986,[422]3.2989,[423]3.2986,[424]3.3025,[425]3.3023,[426]3.3022,[427]3.3016,[428]3.3042,[429]3.3054,[430]3.3082,[431]3.3092,[432]3.3081,[433]3.3044,[434]3.3048,[435]3.2978,[436]3.2922,[437]3.2882,[438]3.2865,[439]3.2838,[440]3.2884,[441]3.2938,[442]3.3010,[443]3.2988,[444]3.2995,[445]3.3006,[446]3.3049,[447]3.3081,[448]3.3102,[449]3.3131,[450]3.3167,[451]3.3196,[452]3.3217,[453]3.3231,[454]3.3217,[455]3.3241,[456]3.3244,[457]3.3268,[458]3.3318,[459]3.3324,[460]3.3326,[461]3.3294,[462]3.3329,[463]3.3401,[464]3.3446,[465]3.3384,[466]3.3363,[467]3.3345,[468]3.3359,[469]3.3333,[470]3.3305,[471]3.3310,[472]3.3315,[473]3.3308,[474]3.3297,[475]3.3309,[476]3.3295,[477]3.3288,[478]3.3296,[479]3.3313,[480]3.3340,[481]3.3301,[482]3.3336,[483]3.3328,[484]3.3362,[485]3.3424,[486]3.3456,[487]3.3490,[488]3.3543,[489]3.3568,[490]3.3618,[491]3.3677,[492]3.3721,[493]3.3719,[494]3.3731,[495]3.3753,[496]3.3772,[497]3.3801,[498]3.3805,[499]3.3800,[500]3.3840,[501]3.3884,[502]3.3874,[503]3.3860,[504]3.3880,[505]3.3911,[506]3.3992,[507]3.4022,[508]3.4056,[509]3.3983,[510]3.3933,[511]3.3871,[512]3.3828,[513]3.3771,[514]3.3755,[515]3.3777,[516]3.3727,[517]3.3729,[518]3.3715,[519]3.3721,[520]3.3760,[521]3.3750,[522]3.3736,[523]3.3789,[524]3.3779,[525]3.3764,[526]3.3723,[527]3.3672,[528]3.3640,[529]3.3609,[530]3.3581,[531]3.3550,[532]3.3495,[533]3.3437,[534]3.3394,[535]3.3401,[536]3.3425,[537]3.3456,[538]3.3478,[539]3.3507,[540]3.3558,[541]3.3588,[542]3.3612,[543]3.3559,[544]3.3518,[545]3.3513,[546]3.3449,[547]3.3389,[548]3.3325,[549]3.3262,[550]3.3204,[551]3.3143,[552]3.3088,[553]3.3030,[554]3.3015,[555]3.2999,[556]3.3026,[557]3.3065,[558]3.3124,[559]3.3168,[560]3.3221,[561]3.3202, llama_print_timings: load time = 10030.15 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 3302906.56 ms / 287232 tokens ( 11.50 ms per token, 86.96 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 3306566.09 ms / 287233 tokens Final estimate: PPL = 3.3202 +/- 0.01831 ``` Some notes: - The resulting model is slightly smaller than the original (353GB vs 387GB) - I initially used `q5_k_r4` instead of `q3_k_r4` in some layers (as per ubergarm's `iq3_k_r4`). Didn't notice the iq vs q until writing this up - I ran a perplexity test, but lack proper benchmarks. The model seems comparable to the original for basic prompts. Some questions: - Does this quantization strategy make sense? - Are there obvious issues with the layer ranges or quantization types? - I meant to make it bigger/higher quality, but file size is smaller. Why is that? - Considering trying again but modifying ubergarms $custom with: iq5_k_r4 in place of iq3_k_r4 and iq4_k_r4 in place of iq2_k_r4. Or should I just put all a to IQ6_K or something? Any feedback or suggestions would be appreciated!