ik_llama.cpp/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md

### 🔀 [#261](https://github.com/ikawrakow/ik_llama.cpp/pull/261) - Compile time option to use bf16 for quants without MMQ kernels

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-03-17 |
| **Updated** | 2025-03-18 |

---

#### Description

The `IQ2_KS, IQ2_K, ..., IQ6_K` quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to `fp16` and `cublasGemmEx` GEMM using `fp16` precision. For the DeepSeek series of MoE models this leads to NaNs.

Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to `bf16` and use `bf16` cuBLAS GEMM. This is added as a compile time option enabled via
```
cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options
```
(or, if you like me prefer using `ccmake`, after pulling the PR, `cmake .. && ccmake .`, and then set the `GGML_CUDA_IQK_FORCE_BF16` to `ON`).

I have tested with DeepSeek-Lite quantized with `IQ4_KSS` and `IQ4_K`. In both cases I get NaNs when running `perplexity` on the main branch. Turning on the `GGML_CUDA_IQK_FORCE_BF16` option provided by this PR results in meaningful PPL values.

@davidsyoung This should solve the issues with the `IQ4_KSS` DeepSeek-R1 model you created.

---

#### 💬 Conversation

👤 **davidsyoung** commented the **2025-03-17** at **23:38:28**:<br>

Awesome! Will re-quant over night and test tomorrow!

---

👤 **saood06** commented the **2025-03-17** at **23:43:23**:<br>

> Awesome! Will re-quant over night and test tomorrow!

In case you still have the old quants, you can just use those with the new code you don't have to make new quants.

---

👤 **davidsyoung** commented the **2025-03-17** at **23:45:25**:<br>

Unfortunately I don’t! My cache drive is limited so I tend to delete pretty soon.