Commit Graph

3399 Commits

Author SHA1 Message Date
Iwan Kawrakow
ff471dfd61 quantize_stats: print rmse and max error as fraction of <x>
This allows for a better comparison between different models
or different tensors of the same model where the magnitude of
the model weights may differ.
2024-08-19 13:47:19 +03:00
Kawrakow
c7b47fc67f iq2_k: slightly better bpw - accuracy compromise (#20)
For LLaMA-3.1 models:
* It is better to quantize all of attn_v with iq3_k instead of
  half of attn_v with iq4_k
* Quantizing attn_output with iq3_k results in a larger PPL decrease
  compared to what one expects from the added bpw.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19 13:36:51 +03:00
Kawrakow
6c5384f20e Skip barriers of noops (#19)
GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE,
along with GGML_OP_NONE, are all noops. I.e., nothinh happens.
But ggml still has a barrier after them, which wastes time.
The waste is not too bad for large models where computations are
long compared to the time taken for thread synchronization.
But for small models skipping those unnecessary waits makes
a significant difference. E.g., for the 99M TriLMamodel,
TG-500 goes up to 1426 t/s from 1240 t/s.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-14 10:40:09 +02:00
Kawrakow
bb5ff6fade Update README.md 2024-08-12 15:16:00 +02:00
Kawrakow
8f43e55103 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Iwan Kawrakow
f5d1af61d7 Fix Makefile
I always use cmake, so had forgotten to pay attention to the
Makefile.
2024-08-09 16:31:04 +02:00
Iwan Kawrakow
f0d7a0d53b Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c77dba5273 iq6_k: AVX2 2024-08-09 16:00:31 +02:00
Iwan Kawrakow
a829cb7794 iq6_k: Metal
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128.
Someone has screwed up Q6_K TG performance on Metal? With the
cobntinuous "improvements" in ggml I wouldn't be surprised.
Need to look into it later.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
48c4389e3d iq6_k: NEON
Respectable performance, only slightly slower than Q6_K.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
595d2ae32d iq6_k: slightly better Zen4 iqk_mul_mat
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
849476acc7 iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
050bdfa101 iq6_k: CUDA dot product
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c3f5e4d9a7 iq6_k: CUDA dequantize
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
a9b3f4a54b iq6_k: WIP (quantize/dequantize) 2024-08-09 16:00:31 +02:00
Iwan Kawrakow
cfb0410067 iq6_k: WIP (nothing works) 2024-08-09 16:00:31 +02:00
Kawrakow
a9f302ebe2 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Iwan Kawrakow
b409c15363 q2_K: allow it to detect ternary nets and quantize accordingly 2024-08-05 11:39:10 +02:00
Kawrakow
c11c7c8cae Update README.md
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
2024-08-05 07:35:30 +02:00
Iwan Kawrakow
6901b3bf14 iq3_k, iq5_k: faster quantization
Just use the same trick as iq4_k
2024-08-05 07:18:18 +02:00
Iwan Kawrakow
e830f4a5f7 iq4_k: speedup quantization by a factor of ~2 2024-08-03 18:38:39 +02:00
Iwan Kawrakow
3d1446b937 Add copyright notice 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
b572dd5347 iq2/3_k: tiny bit faster Metal dot products 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
394ed3913c iq3_k: slightly faster Metal dequantize kernel
PP-512 goes to 473 t/s up from 452 t/s.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
062313dab4 iq3_k: Metal dot product
Quite slow: 43 t/s for a 7B model
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
57df5ccdd7 iq2_k: Metal dot product finally works
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs,
or 63.3 t/s for q2_K_S.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
30d2d1b1eb iq3_k: Metal dequantize 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
4c2c644dcc iq3_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
93d09d1935 iq3_k: AVX2 iqk_mul_mat
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
9d0cf7a399 iq3_k: AVX512 iqk_mul_mat
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
fd1ae85a32 iq3_k: faster CUDA dot product
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
0d19d19af8 iq3_k: CUDA dot product
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
4f237d44f6 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
36204c4ec7 iq2_k: very slightly better CUDA dot product
169.2 t/s vs 167.8 t/s before.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
e950b17125 iq2_k: better CUDA dot product
Almost on par with iq2_xs (168 t/s vs 172 t/s).
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
ab4f9e1fdb iq2_k: CUDA dot product finally works
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs
172 t/s for iq2_xs.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
69842c6ad8 iq5_k: CUDA dot product finally works 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
f6813cac0e Factor out iqk CUDA dot products
I cannot possibly wait for a 5 minutes nvcc compilation
each time I touch vecdotq.cuh.

Also, cmake was adding --options-file X.rsp to the nvcc
compile commands, which confuses clangd, so I have turned
that off.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
22d1568c1c iq5_k: CUDA dot product still not working 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
d8d022a01b iq5_k: Metal
Performance is roughly on par with q5_0.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
bd36ade98d iq5_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0d0607f19 iq5_k: AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c56ddee38c iq5_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
5d341757bc iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
06e255ac9d iq2_k: Metal. Dot product is wrong 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
f476ea3b50 iq2_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0fe03b5c8 iq2_k: slightly faster AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
7d08719975 iq2_k: simplify AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
13091d39e8 iq2_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c85e139c68 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00