Commit Graph

22 Commits

Author SHA1 Message Date
Iwan Kawrakow
31ed9b331e WIP: plugging into ggml_compute_forward_flash_attn_ext_f16
Now everything is done in iqk_flash_helper_2.
It is slower than no FA
at 2048 tokens we have 167 vs 176 t/s.
This is better than Georgi's FA (138 t/s), but...
At 8192 tokens we degrade to 93 t/s vs 134 t/s without.
2024-08-23 16:48:35 +03:00
Iwan Kawrakow
ffeb8b40eb WIP: plugging into ggml_compute_forward_flash_attn_ext_f16
This is now working. It is not faster, but at least it is not
massively slower as the original.
2024-08-23 15:47:08 +03:00
Iwan Kawrakow
b127c6cced WIP: Fusing K*Q and softmax - not working yet 2024-08-23 09:56:28 +03:00
Iwan Kawrakow
f0d7a0d53b Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c77dba5273 iq6_k: AVX2 2024-08-09 16:00:31 +02:00
Iwan Kawrakow
48c4389e3d iq6_k: NEON
Respectable performance, only slightly slower than Q6_K.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
595d2ae32d iq6_k: slightly better Zen4 iqk_mul_mat
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
849476acc7 iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
2024-08-09 16:00:31 +02:00
Kawrakow
a9f302ebe2 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Iwan Kawrakow
4c2c644dcc iq3_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
93d09d1935 iq3_k: AVX2 iqk_mul_mat
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
9d0cf7a399 iq3_k: AVX512 iqk_mul_mat
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
bd36ade98d iq5_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0d0607f19 iq5_k: AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c56ddee38c iq5_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
f476ea3b50 iq2_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0fe03b5c8 iq2_k: slightly faster AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
7d08719975 iq2_k: simplify AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
13091d39e8 iq2_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c85e139c68 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
291066e6df IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00
Kawrakow
154e0d75fc Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00