Files
ik_llama.cpp/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md
2025-07-23 13:31:53 +02:00

45 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
### 🔀 [#261](https://github.com/ikawrakow/ik_llama.cpp/pull/261) - Compile time option to use bf16 for quants without MMQ kernels
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-03-17 |
| **Updated** | 2025-03-18 |
---
#### Description
The `IQ2_KS, IQ2_K, ..., IQ6_K` quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to `fp16` and `cublasGemmEx` GEMM using `fp16` precision. For the DeepSeek series of MoE models this leads to NaNs.
Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to `bf16` and use `bf16` cuBLAS GEMM. This is added as a compile time option enabled via
```
cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options
```
(or, if you like me prefer using `ccmake`, after pulling the PR, `cmake .. && ccmake .`, and then set the `GGML_CUDA_IQK_FORCE_BF16` to `ON`).
I have tested with DeepSeek-Lite quantized with `IQ4_KSS` and `IQ4_K`. In both cases I get NaNs when running `perplexity` on the main branch. Turning on the `GGML_CUDA_IQK_FORCE_BF16` option provided by this PR results in meaningful PPL values.
@davidsyoung This should solve the issues with the `IQ4_KSS` DeepSeek-R1 model you created.
---
#### 💬 Conversation
👤 **davidsyoung** commented the **2025-03-17** at **23:38:28**:<br>
Awesome! Will re-quant over night and test tomorrow!
---
👤 **saood06** commented the **2025-03-17** at **23:43:23**:<br>
> Awesome! Will re-quant over night and test tomorrow!
In case you still have the old quants, you can just use those with the new code you don't have to make new quants.
---
👤 **davidsyoung** commented the **2025-03-17** at **23:45:25**:<br>
Unfortunately I dont! My cache drive is limited so I tend to delete pretty soon.