ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 23:24:13 +00:00

Files

Iwan Kawrakow f21dd3fb15 Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

2024-11-21 08:16:40 +02:00

CMakeLists.txt

WIP

2024-11-21 08:16:40 +02:00

quantize-stats.cpp

Testing Trellis quantization

2024-11-21 08:16:40 +02:00