Commit Graph

3528 Commits

Author SHA1 Message Date
Kawrakow
c19404bcda MoE fix for R4 quants (#170)
* Fix bug in iqk_mul_mat

I recently added the possibility to have a matrix multiplication
kernel that processes 16 columns in the right matrix per iteration.
This introduced a bug that shows up when batch size is greater
than 16, is not a multiple of 16, and the remainder is not a multiple
of the maximum columns being processed by the regular kernels
(and so, never showed up in my testing using TG-128 and PP-512).

This commit fixes the issue.

* Make sure rows per thread is a multiple of 4 also for MoE when using _r4 quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12 13:19:14 +02:00
Kawrakow
7553989dd8 Be able to re-quantize MS BitNet I2_S models (#169)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10 18:18:04 +02:00
Kawrakow
b1363b6177 Falcon3 changes (#168)
* Add Falcon3 pre-tokinizer (same as llama3)

* q8_k16: use integer arithmetic to sum row values

The existing implementation that just sums up the f32 quantizations
works fine for the original BitNet models and also for the TriLM
ternary models. But for Falcon3 I see a significant difference between
the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum
up the values in a row, then the CPU-GPU PPL difference becomes much
smaller, and we get a lower PPL than Microsoft BitNet, which claims
to be "losless".

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10 15:06:00 +02:00
Kawrakow
3e6851621c iq4_0_r4: Use AVX2 version for matrix x vector (#163)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 17:34:08 +01:00
Kawrakow
167479e027 IQ3_S_R4 (#162)
* iq3_s_r4: WIP

* iq3_s_r4: Zen4

* iq3_s_r4: slightly better Zen4

* iq3_s_r4: AVX2

* iq3_s_r4: NEON

* iq3_s_r4: rearrange quants

* iq3_s_r4: rearranged quants - AVX2

* iq3_s_r4: rearranged quants - NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 14:34:23 +01:00
Kawrakow
1a0a35dcd1 MSVC fixes (#161)
Closes #160 

* MSVC fixes

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 07:57:48 +01:00
Kawrakow
3d732cb010 Faster R4 legacy quants (#158)
* q4_0_r4(avx2): convert q8_1 scales with SIMD instrinsics

PP-512 goes to 283 t/s from 265 t/s

* qx_0_r4(AVX2): convert scales with SIMD instrinsics

Also fix q8_0_r4 to not overflow.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22 12:00:22 +01:00
Kawrakow
907cde6be2 R4 i-quants improvements (#157)
* Add nrc_y = 16 implementation.

Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from
148.5 t/s. As we are sure that we will be multiplying with 16
columns, we can spend the time to add the mins and make the
iq2_s quants unsigned.

* nrc_y = 16: AVX2 iq2_s

We go from 176.8 to 203.3 t/s.

* nrc_y = 16: NEON iq2_s

We go from 50.4 to 62.3 t/s.
We didn't need to do anything other than to set func16 to
mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have
so many vector registers for all accumulators, unpacking and preparing
the iq2_s quants is so expensive that we still gain ~23% in performance
by reusing the unpacked quants 16 times instead of just 8, despite
having to load/unload the accumulated results to/from the
available vector registers.

* nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 76.34 -> 85.33 t/s
iq2_xs:  54.13 -> 67.99 t/s
iq3_xxs: 67.45 -> 73.56 t/s

* nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 195.7 -> 221.8 t/s
iq2_xs : 192.6 -> 220.6 t/s
iq3_xxs: 184.4 -> 206.9 t/s

* r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22 10:52:56 +01:00
Kawrakow
93419de68f IQ2_S_R4 (#156)
* iq2_s_r4: Zen4

* Minor

* iq2_s_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 11:26:35 +01:00
Kawrakow
a867b919ca IQ2_XS_R4 (#155)
* iq2_xs_r4: Zen4

* iq2_xs_r4: AVX2

* iq2_xs_r4: slightly better matrix x vector on AVX2

* iq2_xs_r4: NEON - not much better than iq2_xs

* iq2_xs_r4: slightly better NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 08:32:39 +01:00
Kawrakow
4b53bc876e IQ2_XXS_R4 (#154)
* iq2_xxs_r4: Zen4

Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread

* Minor

* iq2_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 12:02:42 +01:00
Nexes the Elder
7254f6d340 fix typo (#151) 2024-12-20 12:02:15 +01:00
Kawrakow
f4170f78bd IQ3_XXS_R4 (#153)
* iq3_xxs_r4: 1st shot on Zen4

PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s

* iq4_xxs_r4: WIP

* iq4_xxs_r4: 1st shot at AVX2

Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.

* iq3_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 09:12:48 +01:00
Kawrakow
dfa12b7f91 IQ4_KS_R4 (#150)
* iq4_ks_r4: Zen4

* iq4_ks_r4: AVX2

* iq4_ks_r4: WIP

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: NEON

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 19:58:21 +01:00
Kawrakow
59d742b00f IQ5_K_R4 (#149)
* iq5_k_r4: Zen4

Much slower than the others.

* iq5_k_r5: WIP

* Minor

* iq5_k_r4: fix AVX2 nrc_y = 1 case

* iq5_k_r4: better Zen4

But TG is still slower than iq5_k

* iq5_k_r4: slightly better AVX2

* iq5_k_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 13:29:25 +01:00
Kawrakow
9b6d14a299 Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4 (#148)
* Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4

More importantly: simplify.

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 18:55:38 +01:00
Kawrakow
514ae08620 Be able to repack tensors at run time (#147)
* Be able to repack tensors at run time

* Repack: also add bf16 as repackable type

* Repack: make sure number of rows is a multiple of the packing

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 14:16:34 +01:00
Kawrakow
4ade4c568c IQ2_K_R4 (#146)
* iq2_k_r4: Zen4

* iq2_k_r4: NEON

* iq2_k_r4: better matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 10:18:33 +01:00
Kawrakow
d69344f8ea IQ3_K_R4 (#145)
* iq3_k_r4 WIP

* iq3_k_r4: Zen4

* iq3_k_r4: AVX2

* iq3_k_r4: NEON

* iq3_k_r4: faster matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 07:51:11 +01:00
Kawrakow
1714e46f13 Slightly faster IQ4_K_R4 on AVX2/Zen4 (#144)
* iq4_k_r4: slightly better AVX2

227 t/s -> 249 t/s

* iq4_k_r4: slightly better Zen4

232 t/s -> 251 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16 16:09:15 +01:00
Kawrakow
84ab873385 Slightly faster IQ4_XS_R4 on AVX2 (#143)
* iq4_xs_r4: slightly faster and correct AVX2 implementation

* Minor

* Delete unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16 14:40:10 +01:00
Iwan Kawrakow
987a43c531 q8_k_r8: this change for NEON got lost? 2024-12-16 13:31:21 +01:00
Kawrakow
85c5a1a995 BF16_R16 - 16 interleaved bf16 rows (#142)
* Not working bf16_r4

* Adding bf16_r8

Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.

* bf16_rx: Very slightly faster by interleaving 16 rows

258 t/s -> 263 t/s

* Rename bf16_r4 to bf16_r16

We are interleaving 16 rows now.

* Cleanup unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-15 09:54:21 +01:00
Kawrakow
20758edcae Q8_K_R8: Fastest quantized matrix multiplications (#141)
* q8_k_r8: fastest matrix multiplication known to human kind

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

* q8_k_r8: AVX2

I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.

* q8_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.

* q8_k_r4: go to signed ints

Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
  stay within the signed int range and use _mm256_sign_epi8.
  Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
  needed tto convert from unsigned to signed seems to be extremely
  slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
  or that we now use to convert fro signed to unsigned seems to be
  much faster than on M2-Max)

* Shutup useless compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-14 09:24:30 +01:00
Kawrakow
12f962dd24 Faster R4 quants on Zen4 (#139)
* q3_k_r4: faster Zen4

* q3_k_r4: faster Zen4

256.2 -> 272.7 t/s for PP-512

* q6_k_r4: faster Zen4

243.2 -> 261.3 t/s for PP-512

* q4_k_r4: slightly faster Zen4

262.4 t/s -> 268.1 t/s

* q5_k_r4: slightly faster Zen4

248.3 t/s -> 256.7 t/s

* iq4_xs_r4: slightly faster Zen4

256.8 t/s -> 272.0 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-13 15:47:59 +01:00
Iwan Kawrakow
36efbfb132 Another fix 2024-12-13 10:14:53 +02:00
Iwan Kawrakow
ff425a3572 Adding lost q4_k_r4 case
Not sure how it got lost.
2024-12-13 10:09:04 +02:00
Kawrakow
2700d3af36 IQ4_K_R4 (#138)
* iq4_k_r4: WIP

* iq4_k_r4: Zen4 and hopefully AVX2

On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.

* iq4_k_r4: AVX2

PP-512 = 227.60 t/s. The shifts are really costly.

* iq4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-12 16:04:20 +01:00
Kawrakow
aecc95c0ca Fix AVX2 implementation of iq4_nl_r4 (#137)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 18:55:21 +01:00
Kawrakow
8c6b84220d Q2_K_R4 (#136)
* q2_k_r4: Zen4

PP-512(LLaMA-3.1-8B) = 256 t/s

* q3_k_r4: AVX2

* q2_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 287 t/s.

Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow
forgot to push upstream.

* q2_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.2 t/s.
TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S.

* Make sure rows per thread are a multiple of 4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 18:16:49 +01:00
Kawrakow
9469af87f7 Better ARM_NEON implementation for R4 quants (#135)
* q6_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 104.2 t/s up from 83.2 t/s.
I.e., q6_k_r4 now beats q6_0_r4.

* q5_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 107.8 t/s up from 96.9 t/s.
I.e., q5_k_r4 now beats q5_0_r4.

* q4_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 122.1 t/s up from 110 t/s.
I.e., q4_k_r4 is now (nearly) on par with q4_0_r4.

* iq4_xs_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 131.3 t/s up from 115.8 t/s.
iq4_xs_r4 is now the prompt processing champion on ARM.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 14:20:27 +01:00
Kawrakow
e0adb8b122 Q3_K_R4 (#134)
* q3_k_r4: Zen4 works, but not as good as it should be

238 t/s, so sloghtly slower than q6_k_r4.

* q3_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.9 t/s.
This is 1.93X faster than q3_K_S!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 11:19:00 +01:00
Kawrakow
a63a96b5ae Q5_K_R4 (#132)
* q5_k_r4: WIP

* q5_k_r4: Zen4 and AVX2

We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4.
Q5_K_S has PP-512 = 190 t/s.

* q5_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 96.1 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 18:13:47 +01:00
Kawrakow
c819fa651b Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4 (#131)
* iq4_k_r4: slightly faster on Zen4

* iq4_xs_r4: very slightly faster Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 14:14:40 +01:00
Kawrakow
361174ee6a Q6_K_R4 (#130)
* Adding q6_k_r4

* q6_k_r4: 1st functional AVX2 version

* q6_k_r4: AVX2 and simple Zen4

"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.

* q6_k_r4: 1st NEON version

PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.

* q6_k_r4: slightly faster NEON

PP-512(LLaMA-3.1-8B) = 83.25 t/s

* q6_k_r4: slightly faster Zen4

238.3 t/s -> 243.2 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 12:26:40 +01:00
Kawrakow
3ec193b485 Q4_K_R4 (#129)
* Something is still wrong

* Simply don't see what is wrong

* q4_k_r4: finally works on Zen4

I had forgotten to prevent token_embd.weight being quantized
with q4_k_r4!

* q4_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX.
This is ~30% better than Q4_K_S.

* q4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 110 t/s.
Not quite as good as q4_0_r4, but still a massive
improvement compared to he 69 t/s for q4_K.

* q4_k_r4: slightly better AVX2

PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX

* Minor

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-09 16:59:18 +01:00
Kawrakow
43e65a672a Faster IQ4_XS_R4 on Zen4 (#128)
* Faster iq4_xs_r4 on Zen4

The trick is to simply prepare the Q8 block sums for
blocks of 32 as floats. This brings PP-512 up to 254.6 t/s
from 224 t/s.

* Fix broken matrix x vector product on Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 15:27:13 +01:00
Kawrakow
fc701cedd1 Rename iq4_nl_x4 to iq4_nl_r4 (#126)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 09:34:42 +01:00
Kawrakow
ef95b81733 R4 improvements on ARM_NEON (#125)
* q4_0_r4: 6% faster PP on NEON

* qx_0_r4_q8_0 template

Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP
~7% faster.

* Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4

* Simplify

* Minor iq4_xs_r4 improvement on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 09:13:10 +01:00
Kawrakow
3682e4700d iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)
* Adding iq2_bn_r4

This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.

* Make sure rows per thread are a multiple of the number of interleaved rows

With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.

* iq2_bn_r4: 1st shot at NEON

PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.

* iq2_bn_r4: NEON

PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.

* iq2_bn_r4: Experimenting on NEON

The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.

* Some cleanup

* iq2_bn_r4: AVX2

As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).

* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector

It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.

* iq2_bn_r4: simdify q8_K16 quantization (AVX2)

PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.

* iq2_bn_r4: simdify q8_K16 quantization (NEON)

PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)

* iq2_bn_r4: fix AVX2 after breaking it two commits ago

* iq2_bn_r4: better AVX2

As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06 12:15:39 +01:00
Kawrakow
f64de08203 IQ4_XS_R4 (#123)
* Adding iq4_xs_r4

This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.

* iq4_xs_r4: WIP

* iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4

* iq4_xs_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!

* DRY

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04 15:20:07 +01:00
Kawrakow
f1f4eb988f Q6_0_R4 (#122)
* Adding q6_0_r4

We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X.

* q6_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max.
In terms of ops, q6_0_r4 is identical to q5_0_r4
except for loading the high bits being
vld1q_u8_x2 instead of vld1q_u8. It is strange that
this can make a 5% difference in performance, especially
considering that this is amortized (re-used) over 8 columns
in the right matrix. Or am I running out of vector registers?

* Fix AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 14:48:26 +01:00
Kawrakow
c5bf589367 Q5_0_R4 (#121)
* Adding q5_0_r4

We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X.
We even get TG-128 improvement to 11.7 t/s from 11.1 t/s.

* q5_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 12:59:22 +01:00
Kawrakow
ccec00939a Q8_0_R4 (#120)
* Adding q8_0_r4

We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared
to 175.6 t/s for Q8_0.

* q8_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max.

* q8_0_r4: Zen4 matrix-vector specialization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 06:15:29 +01:00
Kawrakow
239a344f99 Q4_0_R4 (#119)
* Adding iq4_0_r4 - q4_0 repacked

We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU,
so ~5-6% faster than iq4_nl_x4.

* q4_0_r4: NEON

Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 17:01:48 +01:00
Kawrakow
6d0462d4a3 IQ4_NL_X4 (#118)
* Adding iq4_nl_x4

Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s
on the Ryzen-7950X! This is faster than any other quant and
~40% faster than iq4_nl.

* iq4_nl_x4: getting amazing

This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s!

* iq4_nl_x4: AVX2

Here we gain only 25% compared to iq4_nl

* iq4_nl_x4: NEON

On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from
82.4 t/s for iq4_nl.

* iq4_nl_x4: minor NEON improvement and cleanup

This gets us to 110.3 t/s. In comparison,
IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s.

* iq4_nl_x4: NEON specialization for matrix x vector

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 07:25:39 +01:00
Nexes the Elder
8ad84b9fab Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K (#116) 2024-11-21 12:01:23 +02:00
Kawrakow
4d2fbde0cb MMQ for Q6_0 (#115)
* MMQ for Q6_0

* Add Q6_0 MMQ to template generator

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-11-21 07:12:11 +01:00
Kawrakow
52874c5d21 Faster MoE inference (#112)
* multi_sdd: WIP

* multi_sdd: CPU works

* multi_add: CUDA

* multi_add: simplify

* multi_add: Metal

* Metal: speed up mul_mat_id

For the Granite-1B MoE model PP-512 goes from
156 t/s to 890 t/s, so nearly a 6X speedup!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-31 12:05:27 +01:00
Kawrakow
5ad6439486 Use fused mul - unary op also for MoE models (#111)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 18:23:54 +02:00