* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1
* NEON Flash Attention: quantized K*Q for q4_0
I could finally take advantage of the matrix multiplication
templates. We get quite a bit of speedup that way for q4_0:
For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step>
results in PP-2048 = 287 t/s vs 268 t/s when converting the
q4_0 k-cache and Q to fp16 and using fp16 multiplication.
* NEON Flash Attention: quantized K*Q for q4_1
* NEON Flash Attention: quantized K*Q for q8_0
This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.
* Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0
* AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0
* Tidy up FlashMS
* Delete no longer used stuff
With the usage of quantized matrix multiplications for
quantized k- and/or v-cache, we no longer need the
helper methods loading entire rows.
* Disallow mixing bf16 with other types for kv caches
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* AVX2 Flash Attention: add ability to use Q8_0 for kv-cache
* AVX2 Flash Attention: add ability to use Q4_0 for kv-cache
* AVX2 Flash Attention: add ability to use Q4_1 for kv-cache
* Fix Zen4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* NEON Flash Attention - first working version
Simply reuse the Zen4/AVX2 implementation, but use
f16 for the K*Q multiplication and V*softmax(K*Q) accumulation.
This makes the FlashMS portion somewhat awkward because we
do not have fast f16 implementations for expf (and tanh when
softcap is enabled), so we need to convert back-and-fort
to f32.
FA is slightly faster than no-FA for the 4B TriLM model,
but lightly slower for Gemma-2b.
* NEON Flash Attention - convert Q to f16 before computing Q*K
* NEON Flash Attention - use fp32 for K*Q operations
Else I get wrong results for LLaMA-3.1-8B (but it works for
Gemma-2b).
* Delete commented out stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* First version of AVX2 Flash attention
I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.
Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.
It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.
* Fix Zenn4 parts broken via the AVX2 change
* Try smaller q_step - no improvement
* Fix ARM_NEON
I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models
* iq1_tn: NEON
* iq1_tn: faster NEON
* iq2_bn: improve performance on NEON
We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!
* iq1_tn: improve AVX2
PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq1_tn: improve Zen4
PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq2_bn: improve on Zen4
We now get PP-512 = 614 t/s up from 542 t/s
* iq2_bn: improve AVX2 implementation
We now get PP-512 = 753 t/s up from 680 t/s.
* Remove unnecessary barrier in ggml_compute_forward_mul_mat
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Zen4 Flash Attnetion: WIP bf16
* Zen4 Flash Attnetion: bf16 seems to be working
* Zen4 Flash Attnetion: improving bf16
* Zen4 Flash Attnetion: improving bf16
It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP: trying to improve legacy quants
* WIP: trying to improve legacy quants
With this commit PP-512 for LlaMA-3.1-8B goes from
72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s
for q4_1, so 20+% improvement for both.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Zen4 Flash Attnetion: WIP generalize to other types
Now loading of data from K and V is done via a template parameter,
so this should make it easy to generalize to typ[es other than
F16 for the K and V cache.
* Zen4 Flash Attnetion: it works for q4_0 and q8_0
* Zen4 Flash Attnetion: small q8_0 performance improvement
* Zen4 Flash Attnetion: add q4_1
* Delete unused stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.
My last commit message is wrong: also iq3_k needs a fix
for overflow.
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.
And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.
For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
* iq2_tn: TriLM specific 2.0625 bpw quantization
Quantize/dequantize/scale dot product.
I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.
* iq2_tn: AVX512
Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.
* iq2_tn: AVX512
With this tweak we get to PP-512 = 431 t/s.
* iq2_tn: AVX512
With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.
* iq2_tn: AVX2
PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.
* iq2_tn: initial NEON version
* iq2_tn: NEON
For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.
* iq2_tn: Metal
For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.
* iq2_tn: CUDA
For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.
* iq2_tn: AVX2 PP improvement
We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.
* iq2_tn: small NEON improvement
For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
* iq4_k: basics
* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
* iq4_k: TG now works on CUDA
* iq4_k: AVX512 implementation
For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.
* iq4_k: AVX2 implementation
For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.
* iq4_k: NEON implementation
For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.
* iq4_k: Metal implementation
For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.
* iq4_k: scalar dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>