ik_llama.cpp/261 - Compile time option to use bf16 for quants without MMQ kernels.md at 30381fc1fc6a302f9de0487b1e719f4efcc06a00 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.9 KiB

Raw Blame History

🔀 #261 - Compile time option to use bf16 for quants without MMQ kernels

Author	`ikawrakow`
State	❌ Closed
Created	2025-03-17
Updated	2025-03-18

Description

The IQ2_KS, IQ2_K, ..., IQ6_K quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to fp16 and cublasGemmEx GEMM using fp16 precision. For the DeepSeek series of MoE models this leads to NaNs.

Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to bf16 and use bf16 cuBLAS GEMM. This is added as a compile time option enabled via

cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options

(or, if you like me prefer using ccmake, after pulling the PR, cmake .. && ccmake ., and then set the GGML_CUDA_IQK_FORCE_BF16 to ON).

I have tested with DeepSeek-Lite quantized with IQ4_KSS and IQ4_K. In both cases I get NaNs when running perplexity on the main branch. Turning on the GGML_CUDA_IQK_FORCE_BF16 option provided by this PR results in meaningful PPL values.

@davidsyoung This should solve the issues with the IQ4_KSS DeepSeek-R1 model you created.

💬 Conversation

👤 davidsyoung commented the 2025-03-17 at 23:38:28:

Awesome! Will re-quant over night and test tomorrow!

👤 saood06 commented the 2025-03-17 at 23:43:23:

Awesome! Will re-quant over night and test tomorrow!

In case you still have the old quants, you can just use those with the new code you don't have to make new quants.

👤 davidsyoung commented the 2025-03-17 at 23:45:25:

Unfortunately I don’t! My cache drive is limited so I tend to delete pretty soon.

1.9 KiB Raw Blame History Unescape Escape

🔀 #261 - Compile time option to use bf16 for quants without MMQ kernels

Description

💬 Conversation

1.9 KiB

Raw Blame History