### π [#261](https://github.com/ikawrakow/ik_llama.cpp/pull/261) - Compile time option to use bf16 for quants without MMQ kernels
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | β **Closed** |
| **Created** | 2025-03-17 |
| **Updated** | 2025-03-18 |
---
#### Description
The `IQ2_KS, IQ2_K, ..., IQ6_K` quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to `fp16` and `cublasGemmEx` GEMM using `fp16` precision. For the DeepSeek series of MoE models this leads to NaNs.
Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to `bf16` and use `bf16` cuBLAS GEMM. This is added as a compile time option enabled via
```
cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options
```
(or, if you like me prefer using `ccmake`, after pulling the PR, `cmake .. && ccmake .`, and then set the `GGML_CUDA_IQK_FORCE_BF16` to `ON`).
I have tested with DeepSeek-Lite quantized with `IQ4_KSS` and `IQ4_K`. In both cases I get NaNs when running `perplexity` on the main branch. Turning on the `GGML_CUDA_IQK_FORCE_BF16` option provided by this PR results in meaningful PPL values.
@davidsyoung This should solve the issues with the `IQ4_KSS` DeepSeek-R1 model you created.
---
#### π¬ Conversation
π€ **davidsyoung** commented the **2025-03-17** at **23:38:28**:
Awesome! Will re-quant over night and test tomorrow!
---
π€ **saood06** commented the **2025-03-17** at **23:43:23**:
> Awesome! Will re-quant over night and test tomorrow!
In case you still have the old quants, you can just use those with the new code you don't have to make new quants.
---
π€ **davidsyoung** commented the **2025-03-17** at **23:45:25**:
Unfortunately I donβt! My cache drive is limited so I tend to delete pretty soon.