Commit Graph

3399 Commits

Author SHA1 Message Date
Kawrakow
fad55b735e quantize_stats: print rmse and max error as fraction of <x> (#21)
This allows for a better comparison between different models
or different tensors of the same model where the magnitude of
the model weights may differ.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19 13:49:28 +03:00
Kawrakow
041d79925c iq2_k: slightly better bpw - accuracy compromise (#20)
For LLaMA-3.1 models:
* It is better to quantize all of attn_v with iq3_k instead of
  half of attn_v with iq4_k
* Quantizing attn_output with iq3_k results in a larger PPL decrease
  compared to what one expects from the added bpw.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19 13:36:51 +03:00
Kawrakow
a58853bf5e Skip barriers of noops (#19)
GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE,
along with GGML_OP_NONE, are all noops. I.e., nothinh happens.
But ggml still has a barrier after them, which wastes time.
The waste is not too bad for large models where computations are
long compared to the time taken for thread synchronization.
But for small models skipping those unnecessary waits makes
a significant difference. E.g., for the 99M TriLMamodel,
TG-500 goes up to 1426 t/s from 1240 t/s.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-14 10:40:09 +02:00
Kawrakow
25ade24526 Update README.md 2024-08-12 15:16:00 +02:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
5ed6d94cb5 Fix Makefile
I always use cmake, so had forgotten to pay attention to the
Makefile.
2024-08-09 16:31:04 +02:00
Kawrakow
af2bb96de5 Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Kawrakow
3f67708b91 iq6_k: AVX2 2024-08-09 16:00:31 +02:00
Kawrakow
fa668c7dcb iq6_k: Metal
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128.
Someone has screwed up Q6_K TG performance on Metal? With the
cobntinuous "improvements" in ggml I wouldn't be surprised.
Need to look into it later.
2024-08-09 16:00:31 +02:00
Kawrakow
ed462a512a iq6_k: NEON
Respectable performance, only slightly slower than Q6_K.
2024-08-09 16:00:31 +02:00
Kawrakow
ef32a01c2a iq6_k: slightly better Zen4 iqk_mul_mat
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
2024-08-09 16:00:31 +02:00
Kawrakow
0bee1c0c0a iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
2024-08-09 16:00:31 +02:00
Kawrakow
1593acd09a iq6_k: CUDA dot product
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
2024-08-09 16:00:31 +02:00
Kawrakow
4fda827258 iq6_k: CUDA dequantize
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
2024-08-09 16:00:31 +02:00
Kawrakow
4b2c94618f iq6_k: WIP (quantize/dequantize) 2024-08-09 16:00:31 +02:00
Kawrakow
81266c22d6 iq6_k: WIP (nothing works) 2024-08-09 16:00:31 +02:00
Kawrakow
58a323f585 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Kawrakow
695c7eef49 q2_K: allow it to detect ternary nets and quantize accordingly 2024-08-05 11:39:10 +02:00
Kawrakow
74f2f50abf Update README.md
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
2024-08-05 07:35:30 +02:00
Kawrakow
8db9c5b1fd iq3_k, iq5_k: faster quantization
Just use the same trick as iq4_k
2024-08-05 07:18:18 +02:00
Kawrakow
5b06d7999c iq4_k: speedup quantization by a factor of ~2 2024-08-03 18:38:39 +02:00
Kawrakow
2af0d6fbac Add copyright notice 2024-08-01 09:38:06 +02:00
Kawrakow
904fdbcfb7 iq2/3_k: tiny bit faster Metal dot products 2024-08-01 09:38:06 +02:00
Kawrakow
088a8360a1 iq3_k: slightly faster Metal dequantize kernel
PP-512 goes to 473 t/s up from 452 t/s.
2024-08-01 09:38:06 +02:00
Kawrakow
606f02ae89 iq3_k: Metal dot product
Quite slow: 43 t/s for a 7B model
2024-08-01 09:38:06 +02:00
Kawrakow
95a6820d79 iq2_k: Metal dot product finally works
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs,
or 63.3 t/s for q2_K_S.
2024-08-01 09:38:06 +02:00
Kawrakow
033299c9f9 iq3_k: Metal dequantize 2024-08-01 09:38:06 +02:00
Kawrakow
2927d4f841 iq3_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
9c1eea6048 iq3_k: AVX2 iqk_mul_mat
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01 09:38:06 +02:00
Kawrakow
a9fa3b1563 iq3_k: AVX512 iqk_mul_mat
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01 09:38:06 +02:00
Kawrakow
a4371b7842 iq3_k: faster CUDA dot product
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
2024-08-01 09:38:06 +02:00
Kawrakow
81f15c0ba8 iq3_k: CUDA dot product
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
2024-08-01 09:38:06 +02:00
Kawrakow
fb4cff3458 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Kawrakow
7dcd64c9bd iq2_k: very slightly better CUDA dot product
169.2 t/s vs 167.8 t/s before.
2024-08-01 09:38:06 +02:00
Kawrakow
0c1d7383a5 iq2_k: better CUDA dot product
Almost on par with iq2_xs (168 t/s vs 172 t/s).
2024-08-01 09:38:06 +02:00
Kawrakow
f30bcc1e17 iq2_k: CUDA dot product finally works
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs
172 t/s for iq2_xs.
2024-08-01 09:38:06 +02:00
Kawrakow
53fdb30ca6 iq5_k: CUDA dot product finally works 2024-08-01 09:38:06 +02:00
Kawrakow
8654a425ae Factor out iqk CUDA dot products
I cannot possibly wait for a 5 minutes nvcc compilation
each time I touch vecdotq.cuh.

Also, cmake was adding --options-file X.rsp to the nvcc
compile commands, which confuses clangd, so I have turned
that off.
2024-08-01 09:38:06 +02:00
Kawrakow
99456e2e94 iq5_k: CUDA dot product still not working 2024-08-01 09:38:06 +02:00
Kawrakow
b591023479 iq5_k: Metal
Performance is roughly on par with q5_0.
2024-08-01 09:38:06 +02:00
Kawrakow
0ab3f0ff86 iq5_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
daf608e227 iq5_k: AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
e9c3ebcbe9 iq5_k: AVX2 2024-08-01 09:38:06 +02:00
Kawrakow
e5cd93b4b7 iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Kawrakow
ace8f921bb iq2_k: Metal. Dot product is wrong 2024-08-01 09:38:06 +02:00
Kawrakow
f7ab9a13df iq2_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
cc8e351b68 iq2_k: slightly faster AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
764d4675b8 iq2_k: simplify AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
21319d6fca iq2_k: AVX2 2024-08-01 09:38:06 +02:00
Kawrakow
3f7dad3000 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00