Commit Graph

40 Commits

Author SHA1 Message Date
Iwan Kawrakow
5f3f3bb09e iqk_mul_mat: better srategy when nrc_y not divisible by ny 2024-10-01 08:12:29 +03:00
Kawrakow
6dec4af4b6 Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
2874b98400 iqk_mul_mat(ARM_NEON): adding bf16 support (#41)
It looks like ArmV8 ISA has support for bf16, but my M2 Max
does not have it, so resorting to bf16 -> f32 conversion and
computations in f32. This is 2x slower than f16, but 8x better
compared to what I get if I try to run a bf16 model on the M2
(NEON and Metal).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-16 16:47:36 +03:00
Iwan Kawrakow
20f3e6fd2d Minor 2024-09-15 12:59:14 +03:00
Kawrakow
76be98fdec Improve Q5_0 performance (#55)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 19:47:26 +03:00
Kawrakow
064b99365c Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 13:53:50 +03:00
Iwan Kawrakow
ec1cbc8884 Minor 2024-09-13 15:46:36 +03:00
Kawrakow
f853f6c6a5 Fix bug and D < 128 case for Q8_0 k-cache (#52)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-13 07:19:47 +03:00
Kawrakow
5017f8b3f0 Quantized Flash Attention for all supported CPU platforms (#51)
* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1

* NEON Flash Attention: quantized K*Q for q4_0

I could finally take advantage of the matrix multiplication
templates. We get quite a bit of speedup that way for q4_0:
For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step>
results in PP-2048 = 287 t/s vs 268 t/s when converting the
q4_0 k-cache and Q to fp16 and using fp16 multiplication.

* NEON Flash Attention: quantized K*Q for q4_1

* NEON Flash Attention: quantized K*Q for q8_0

This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.

* Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0

* AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0

* Tidy up FlashMS

* Delete no longer used stuff

With the usage of quantized matrix multiplications for
quantized k- and/or v-cache, we no longer need the
helper methods loading entire rows.

* Disallow mixing bf16 with other types for kv caches

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-12 19:03:20 +03:00
Kawrakow
c920195edd AVX2 Flash Attention 2 (#50)
* AVX2 Flash Attention: add ability to use Q8_0 for kv-cache

* AVX2 Flash Attention: add ability to use Q4_0 for kv-cache

* AVX2 Flash Attention: add ability to use Q4_1 for kv-cache

* Fix Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11 19:55:42 +03:00
Kawrakow
d98a6753a6 ARM_NEON Flash Attention (#49)
* NEON Flash Attention - first working version

Simply reuse the Zen4/AVX2 implementation, but use
f16 for the K*Q multiplication and V*softmax(K*Q) accumulation.
This makes the FlashMS portion somewhat awkward because we
do not have fast f16 implementations for expf (and tanh when
softcap is enabled), so we need to convert back-and-fort
to f32.

FA is slightly faster than no-FA for the 4B TriLM model,
but lightly slower for Gemma-2b.

* NEON Flash Attention - convert Q to f16 before computing Q*K

* NEON Flash Attention - use fp32 for K*Q operations

Else I get wrong results for LLaMA-3.1-8B (but it works for
Gemma-2b).

* Delete commented out stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11 10:26:49 +03:00
Kawrakow
72f5dfe12a AVX2 Flash Attention (#48)
* First version of AVX2 Flash attention

I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.

Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.

It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.

* Fix Zenn4 parts broken via the AVX2 change

* Try smaller q_step - no improvement

* Fix ARM_NEON

I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10 19:17:04 +03:00
Kawrakow
d17d0c4426 iq2_tn: slightly better performance on AVX2 (#47)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10 16:21:57 +03:00
Kawrakow
8c86231f93 Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models

* iq1_tn: NEON

* iq1_tn: faster NEON

* iq2_bn: improve performance on NEON

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

* iq1_tn: improve AVX2

PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq1_tn: improve Zen4

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq2_bn: improve on Zen4

We now get PP-512 = 614 t/s up from 542 t/s

* iq2_bn: improve AVX2 implementation

We now get PP-512 = 753 t/s up from 680 t/s.

* Remove unnecessary barrier in ggml_compute_forward_mul_mat

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 14:56:34 +03:00
Kawrakow
bf4b19b474 iq2_tn: slightly faster PP (#43)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-08 12:41:44 +03:00
Kawrakow
0087008d29 Add support for bf16 to iqk_mul_mat (#39)
* WIP: adding BF16 support to iqk_mul_mat

* Minor

* Improve TG speed (when not memory bound)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05 07:48:27 +03:00
Kawrakow
7b1b2b2c06 Zen4 Flash Attention - bf16 support (#38)
* Zen4 Flash Attnetion: WIP bf16

* Zen4 Flash Attnetion: bf16 seems to be working

* Zen4 Flash Attnetion: improving bf16

* Zen4 Flash Attnetion: improving bf16

It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05 07:46:47 +03:00
Kawrakow
f17d0d72f5 Performance improvements for legacy quants on ARM_NEON (#37)
* WIP: trying to improve legacy quants

* WIP: trying to improve legacy quants

With this commit PP-512 for LlaMA-3.1-8B goes from
72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s
for q4_1, so 20+% improvement for both.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-04 07:24:04 +03:00
Kawrakow
8c94dcd433 Zen4 Flash Attnetion 2 (#36)
* Zen4 Flash Attnetion: WIP generalize to other types

Now loading of data from K and V is done via a template parameter,
so this should make it easy to generalize to typ[es other than
F16 for the K and V cache.

* Zen4 Flash Attnetion: it works for q4_0 and q8_0

* Zen4 Flash Attnetion: small q8_0 performance improvement

* Zen4 Flash Attnetion: add q4_1

* Delete unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-04 07:20:55 +03:00
Kawrakow
9b53c2533f Fix Zen4 Flash Attention (#35)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-02 15:54:24 +03:00
Kawrakow
dc023bc3be Zen4 Flash Attention (#32)
* Zen4 flash attention: moving useful parts from the kq_fused_softmax branch

* Add flash attention with soft-cap and fix D = 256 case

* Flash attention refinements

* Update FlashAttn comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-01 16:08:21 +03:00
Iwan Kawrakow
f0d7a0d53b Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c77dba5273 iq6_k: AVX2 2024-08-09 16:00:31 +02:00
Iwan Kawrakow
48c4389e3d iq6_k: NEON
Respectable performance, only slightly slower than Q6_K.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
595d2ae32d iq6_k: slightly better Zen4 iqk_mul_mat
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
849476acc7 iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
2024-08-09 16:00:31 +02:00
Kawrakow
a9f302ebe2 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Iwan Kawrakow
4c2c644dcc iq3_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
93d09d1935 iq3_k: AVX2 iqk_mul_mat
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
9d0cf7a399 iq3_k: AVX512 iqk_mul_mat
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
bd36ade98d iq5_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0d0607f19 iq5_k: AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c56ddee38c iq5_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
f476ea3b50 iq2_k: NEON 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c0fe03b5c8 iq2_k: slightly faster AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
7d08719975 iq2_k: simplify AVX512 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
13091d39e8 iq2_k: AVX2 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c85e139c68 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
291066e6df IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00
Kawrakow
154e0d75fc Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00