ik_llama.cpp/166 - Learning more LLM quantization.md at eaa2510a28b60d43c2210c69cefdf750d5cc119f - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.3 KiB

Raw Blame History

🗣️ #166 - Learning more LLM quantization

Author	`robinnarsinghranabhat`
Created	2025-01-05
Updated	2025-03-13

Description

For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.

It might sound silly but we have separate tricks for quantization during training and during evaluation right ?

🗣️ Discussion

👤 ikawrakow replied the 2025-01-05 at 10:37:28:

For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.

I developed all quantization types in llama.cpp apart from the legacy quants Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 (but these are very simple round-to-nearest block-wise quants). I did not read any research papers, just went ahead and experimented. Rarely reading papers has always been my approach to research. I have found that reading what others have done influences my thinking direction and hence may prevent finding a better approach. I only go and read papers if I was not able to find a meaningful solution to a problem on my own.

It might sound silly but we have separate tricks for quantization during training and during evaluation right ?

llama.cpp does not do any training, so it is always post-training quantization (PTQ). But in general there is quantization-aware training (QAT), where the model is not actually quantized during training but model weights are forced to stay within a specified range with the hope that this will give better PTQ results. The only actually quantized model training approach I'm aware of is Bitnet from Microsoft Research, where a ternary model is trained (weights are -1, 0, 1, plus a per tensor float scaling factor). More recently researchers have been utilizing fine-tuning for PTQ, where some corpus of training data is used to guide PTQ (look for, e.g., Quip#, AQLM, QTIP). This is quite different from the simple quantization approaches used in llama.cpp and also here in this repository, requires a full-fledged training framework such as PyTorch, powerful GPU(s), and many hours/days of GPU time.

👤 robinnarsinghranabhat replied the 2025-01-10 at 21:38:11:

Thank you for this humble response !

Now I understand it's doing inference on quantized weights.

But I get lost trying to understand llama cpp codebase. how should I navigate this codebase ? I am comfortable with python, machine learning concepts and understand pointers in C. But never written complex programs in C/C++.

Do I need to understand fundamentals concept on operating systems, comp.arch, memory-management e.t.c. ?

I want to be a programmar like you.

Sorry .. lots of questions all over the place :(

👤 arnfaldur replied the 2025-03-13 at 02:10:31:
Trying to understand this codebase isn't attacking the wall where it's lowest. You're probably best off finding some beginner/intermediate C++ courses online. I imagine that there are plenty available for free. You don't strictly need to understand all these fundamentals to understand what this project is doing, but you sound like you're in the don't know what you don't know phase and a general Computer Science course would likely get you the farthest at this point.

3.3 KiB Raw Blame History

🗣️ #166 - Learning more LLM quantization

Description

🗣️ Discussion

3.3 KiB

Raw Blame History