ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-05 14:00:10 +00:00

Author	SHA1	Message	Date
Kawrakow	785cac7ee5	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Kawrakow	1f9541172f	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Kawrakow	39982764d7	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Kawrakow	318899c8b7	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Kawrakow	f9ba085ef7	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	2024-06-22 12:02:51 +03:00
Kawrakow	b0967ffa79	bitnet: fix scalar dot product I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!	2024-06-22 12:02:51 +03:00
Kawrakow	81576cdcac	bitnet: python + llama	2024-06-22 12:02:51 +03:00