Files
ik_llama.cpp/github-data/pull_requests/292 - Use bf16 instead of fp16 block scales for q8_1.md
2025-07-23 13:31:53 +02:00

69 lines
2.9 KiB
Markdown

### 🔀 [#292](https://github.com/ikawrakow/ik_llama.cpp/pull/292) - Use bf16 instead of fp16 block scales for q8_1
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-03-26 |
| **Updated** | 2025-03-27 |
---
#### Description
DeepSeek-V3/R1 gives NaNs when inference is run on a computer with `AVX512_VNNI` and the model is quantized with `Q8_0/Q8_0_R8` (issue #285). The difference to vanilla `AVX2` is that in that case activations are quantized with `Q8_1/Q8_1_X4`. The block scale and sum in `Q8_1/Q8_1_X4` are `fp16`.
We did have similar issues with `IQ1_S`, which was solved in #194 by going to a different quantization type for the activations. I did create issue #196 because of that.
We also observed NaNs on CUDA for `IQ4_K` and `IQ4_KS`. These quantization types do not have MMQ kernels, so matrix multiplications were done via dequantization to `fp16` and cuBLAS GEMM. The NaNs were resolved via dequantizing to `bf16` instead (PR #261)
So, it seems one can not use `fp16` arithmetic in DeepSeek-V3/R1.
This is further confirmed by #291, where we observe no NaNs when switching `Q8_0/Q8_0_R8` to vanilla `AVX2` implementation.
This PR introduces `Q8_2/Q8_2_X4` quantization types that use `bf16` block scale and sum. All quantization types that previously used `Q8_1/Q8_1_X4` to quantize activations for CPU GEMM/GEMV are switched to `Q8_2/Q8_2_X4`.
This should resolve all NaNs on the CPU.
I wonder why we are not getting NaNs on CUDA for the quantization types that do use `Q8_1`. Or maybe we do, and it is just that nobody has reported.
Closes #285 and #196
---
#### 💬 Conversation
👤 **ubergarm** commented the **2025-03-26** at **19:37:47**:<br>
I'm mostly afk until Friday, but will try to rebuild with this PR and test perplexity and imatrix again on a `q8_0` on the CPU only xeon 6980P rig if I get a moment before then. Thanks!
---
👤 **ikawrakow** commented the **2025-03-27** at **04:49:07**:<br>
Thank you for verifying that it works!
---
👤 **saood06** commented the **2025-03-27** at **08:14:07**:<br>
> Closes #285 and #196
This only closed #285, for multiple commands need to use a comma and repeat each command ([source](https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue)).
Closes #196
---
👤 **saood06** commented the **2025-03-27** at **08:23:08**:<br>
>So, it seems one can not use fp16 arithmetic in DeepSeek-V3/R1.
Is this why https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12429240 the imatrix in that comment was failing?
---
👤 **ikawrakow** commented the **2025-03-27** at **08:27:17**:<br>
> Is this why https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12429240 the imatrix in that comment was failing?
With a very high degree of probability, yes. I get NaNs even for DeepSeek-Lite when I use the `fp16` model on the GPU.