Commit Graph

3555 Commits

Author SHA1 Message Date
Iwan Kawrakow
2d34c55b6f Merge remote-tracking branch 'origin/main' into ik/q4_0_r8 2025-01-27 16:57:34 +02:00
Kawrakow
d9c4ea48d1 Interleave 8 rows (Q8_0, IQ4_XS) (#178)
* Try interleaving 8 rows for iq4_xs

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).

* Try interleaving 8 iq4_xs rows

It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.

* Cleanup

* 8-rows interleaved q8_0 (AVX2)

* 8-rows interleaved q8_0 (Zen4)

* 8-rows interleaved q8_0 (Zen4) - slightly better

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.

* 8-rows interleaved q8_0 (NEON)

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.

* FA: repack Q8_0 to Q8_0_R8

* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

* FA: repack Q8_0 to Q8_0_R8 (NEON)

Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27 16:50:07 +02:00
Iwan Kawrakow
fac48faa21 Process up to 16 columns per kernel call for q8_k_r8
This brings PP-512 up to 389 t/s.
2025-01-27 12:39:56 +02:00
Iwan Kawrakow
f1c114d477 Apply platform specific modifications when repacking
On Zen4 we can pre-convert the signed quants in q8_0_r4 and
q8_k_r8 to unsigned thus avoiding these operations in matrix
multiplications. With this change we hit
PP-512 = 382.40 t/s (q8_k_r8)
PP-512 = 306.92 t/s (q8_0_r4)
for L3-8B on a Ryzen-7950X using q8_0 KV-cache.
2025-01-27 11:59:30 +02:00
Iwan Kawrakow
8b3c66063f Apply platform specific modifications when repacking
E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0.
This results in a ~3% performance improvement.
Hence,
* Changed the signature of the repack_X functions to take a
  bool argument indicating if the repacking is done online and,
  if so, apply modifications as appropriate while repacking.
* Added iqk_modify_tensor to apply modifications to models that
  have already been repacked while loading the model. Caveat:
  just like rtr, this needs to have mmap disabled (else one would
  need to move the data to a not mmap-ed buffer, so much more
  complicated).
2025-01-27 11:12:18 +02:00
Iwan Kawrakow
ee8f966202 q4_0_r8 (Zen4) - slightly better
282 t/s for a pure q4_0 L3-8B quantization.
2025-01-27 09:19:13 +02:00
Iwan Kawrakow
17d6c431a3 q4_0_r8 (Zen4)
Somehow only marginally faster?
268 t/s vs 261 t/s
2025-01-27 08:08:18 +02:00
Iwan Kawrakow
afa2323c4c q4_0_r8 (NEON)
Tiny bit faster PP (~128 vs ~126 t/s), same TG.
2025-01-26 17:46:57 +02:00
Iwan Kawrakow
e9c74af22b q4_0_r8 (AVX2) 2025-01-26 16:04:28 +02:00
Iwan Kawrakow
56ca4c3ba9 FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
2025-01-26 12:24:38 +02:00
Iwan Kawrakow
3484ee6ddb Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) 2025-01-26 11:34:57 +02:00
Iwan Kawrakow
cc438189d5 FA: repack Q8_0 to Q8_0_R8 2025-01-26 10:50:48 +02:00
Iwan Kawrakow
4de6088eef 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
2025-01-26 09:43:22 +02:00
Iwan Kawrakow
45075579ef 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
2025-01-26 07:45:08 +02:00
Iwan Kawrakow
1774ef6b07 8-rows interleaved q8_0 (Zen4) 2025-01-26 07:11:42 +02:00
Iwan Kawrakow
1053ac50fe 8-rows interleaved q8_0 (AVX2) 2025-01-26 06:24:35 +02:00
Iwan Kawrakow
3bfe569348 Cleanup 2025-01-25 17:22:35 +02:00
Iwan Kawrakow
9354ea22f6 Try interleaving 8 iq4_xs rows
It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
2025-01-25 15:17:23 +02:00
Iwan Kawrakow
1ac69af2fe Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
2025-01-25 11:01:44 +02:00
Kawrakow
814d3e054c Update chat templates (#177)
* Adopting chat template stuff from llama.cpp

* Removing missed conflict marker

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24 06:30:10 +02:00
saood06
2195632581 Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Iwan Kawrakow
c2624b2fd3 Add Deepseek-R1-Distill pre-tokenizer 2025-01-23 13:10:03 +02:00
Kawrakow
dbf5d31d01 Better BF16 support on AVX2 (#175)
* Adding BF16 support for AVX2

PP performance is the same as fp16 (~153 t/s on Ryzen-5975WX),
but TG is quite a bit lower (3.65 t/s vs 4.72 t/s at 8 threads).
Why?

* Slightly faster fp16/bf16 gemv on AVX2

It still saturates at the same lower peformance for bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-22 12:13:55 +02:00
Kawrakow
6d23495b9b On Zen4 repack fp16 models to bf16_r16 when run-time-repacking is requested (#174)
This massively improves performance. As this is opt-in, we do not worry
about possible precision loss in the f16 -> bf16 conversion.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-21 19:19:38 +02:00
Kawrakow
3c5f87225f More Flash Attention improvements (#173)
* FA: slightly faster V*softmax(K*Q)) on Zen4

* FA: it is also faster on AVX2 and ARM_NEON

* Deleted forgotten commented out code

* FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache

* FA: slightly faster V*softmax(K*Q)) on Zen4

We now get 130.9 t/s for a context of 32k tokens.

* FA: don't store sum scaling factor in SIMD registers

* FA: timing

* FA: faster q8_0 cache via run-time-repacking

On Zen4 q8_0 KV-cache now slightly outperforms BF16.
We get 134 t/s for 32k tokens, which is ~30% better than
the main branch, and ~18% better than the last commit.
We simply repack the K-cache to q8_0_r4 before the K*Q
multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication
template.

* FA: Fix AVX2

* FA: fix ARN_NEON

* FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON

* FA: dedicated mat mul for D = 128 also for ARM_NEON

* FA: turn off performance timer

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-20 08:57:38 +02:00
Kawrakow
0b74397d59 CPU Flash Attention improvements (#172)
* Slightly faster FA for bf16 KV cache

~2-3% sort of thing. Sadly, when we go beyond 8k tokens, the
advantage kind of goes away.

* Slightly faster FA for Q8_0 KV cache

* FA: allow bf16 for V-cache with any supported K-cache

E.g., -ctk q8_0 -ctv bf16 is slightly faster than
-ctk q8_0 -ctv q8_0 on Zen4 for not too long context lengths
(say, <= 4096).

* FA: much better bf16 kv-cache speed for large contexts

We now hit 122 t/s for LLaMA-3.1-8B (quantized as iq4_xs and
run-time-repacked) with a context of 32768. IIRC, the previous
best for such large context was ~90 t/s.
Non-negligible improvement at 16384 and 8192 as well:
173.4 and 214 t/s.

* FA: slightly better quantized kv-cache speed for large contexts

E.g., for q8_0 and context of 32768, we are now at 113 t/s
for LLaMA-3.1-8B.

Also simplified the quantized K*Q multiplication.

* Fix q8_0 KV cache when not using FA - WIP (AVX2)

1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use
   those to quantize activations for quants that use Q8_0 or Q8_1
   as their vec_dot type.
2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1
3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type
4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than
   GGML_TYPE_Q8_0_X4 as the K and V types
5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4
   in iqk_mul_mat

Also added an optimization in ggml_compute_forward_mul_mat when
ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process
n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using
nthread/GCD(n12*ne13, nthread) threads per head. This results in
a non-negligible performance gain for large contexts.

Question: why is it not allowed to use quantized V-cache when
not using FA?

* Fix q8_0 KV cache when not using FA - NEON

* Fix AVX2

Again the issue with _mm256_maddubs_epi16 overflowing that I
keep forgetting.

* FA: don't use large Q steps on AVX2 for fp16 K-cache

* On Zen4 it is also better to not use large Q steps for fp16 K-cache

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-15 18:19:22 +02:00
Kawrakow
49b27069fd Fix the strange FA behavior with odd/even batch sizes (#171)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12 16:51:06 +02:00
Kawrakow
c19404bcda MoE fix for R4 quants (#170)
* Fix bug in iqk_mul_mat

I recently added the possibility to have a matrix multiplication
kernel that processes 16 columns in the right matrix per iteration.
This introduced a bug that shows up when batch size is greater
than 16, is not a multiple of 16, and the remainder is not a multiple
of the maximum columns being processed by the regular kernels
(and so, never showed up in my testing using TG-128 and PP-512).

This commit fixes the issue.

* Make sure rows per thread is a multiple of 4 also for MoE when using _r4 quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12 13:19:14 +02:00
Kawrakow
7553989dd8 Be able to re-quantize MS BitNet I2_S models (#169)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10 18:18:04 +02:00
Kawrakow
b1363b6177 Falcon3 changes (#168)
* Add Falcon3 pre-tokinizer (same as llama3)

* q8_k16: use integer arithmetic to sum row values

The existing implementation that just sums up the f32 quantizations
works fine for the original BitNet models and also for the TriLM
ternary models. But for Falcon3 I see a significant difference between
the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum
up the values in a row, then the CPU-GPU PPL difference becomes much
smaller, and we get a lower PPL than Microsoft BitNet, which claims
to be "losless".

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10 15:06:00 +02:00
Kawrakow
3e6851621c iq4_0_r4: Use AVX2 version for matrix x vector (#163)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 17:34:08 +01:00
Kawrakow
167479e027 IQ3_S_R4 (#162)
* iq3_s_r4: WIP

* iq3_s_r4: Zen4

* iq3_s_r4: slightly better Zen4

* iq3_s_r4: AVX2

* iq3_s_r4: NEON

* iq3_s_r4: rearrange quants

* iq3_s_r4: rearranged quants - AVX2

* iq3_s_r4: rearranged quants - NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 14:34:23 +01:00
Kawrakow
1a0a35dcd1 MSVC fixes (#161)
Closes #160 

* MSVC fixes

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 07:57:48 +01:00
Kawrakow
3d732cb010 Faster R4 legacy quants (#158)
* q4_0_r4(avx2): convert q8_1 scales with SIMD instrinsics

PP-512 goes to 283 t/s from 265 t/s

* qx_0_r4(AVX2): convert scales with SIMD instrinsics

Also fix q8_0_r4 to not overflow.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22 12:00:22 +01:00
Kawrakow
907cde6be2 R4 i-quants improvements (#157)
* Add nrc_y = 16 implementation.

Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from
148.5 t/s. As we are sure that we will be multiplying with 16
columns, we can spend the time to add the mins and make the
iq2_s quants unsigned.

* nrc_y = 16: AVX2 iq2_s

We go from 176.8 to 203.3 t/s.

* nrc_y = 16: NEON iq2_s

We go from 50.4 to 62.3 t/s.
We didn't need to do anything other than to set func16 to
mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have
so many vector registers for all accumulators, unpacking and preparing
the iq2_s quants is so expensive that we still gain ~23% in performance
by reusing the unpacked quants 16 times instead of just 8, despite
having to load/unload the accumulated results to/from the
available vector registers.

* nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 76.34 -> 85.33 t/s
iq2_xs:  54.13 -> 67.99 t/s
iq3_xxs: 67.45 -> 73.56 t/s

* nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 195.7 -> 221.8 t/s
iq2_xs : 192.6 -> 220.6 t/s
iq3_xxs: 184.4 -> 206.9 t/s

* r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22 10:52:56 +01:00
Kawrakow
93419de68f IQ2_S_R4 (#156)
* iq2_s_r4: Zen4

* Minor

* iq2_s_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 11:26:35 +01:00
Kawrakow
a867b919ca IQ2_XS_R4 (#155)
* iq2_xs_r4: Zen4

* iq2_xs_r4: AVX2

* iq2_xs_r4: slightly better matrix x vector on AVX2

* iq2_xs_r4: NEON - not much better than iq2_xs

* iq2_xs_r4: slightly better NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 08:32:39 +01:00
Kawrakow
4b53bc876e IQ2_XXS_R4 (#154)
* iq2_xxs_r4: Zen4

Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread

* Minor

* iq2_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 12:02:42 +01:00
Nexes the Elder
7254f6d340 fix typo (#151) 2024-12-20 12:02:15 +01:00
Kawrakow
f4170f78bd IQ3_XXS_R4 (#153)
* iq3_xxs_r4: 1st shot on Zen4

PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s

* iq4_xxs_r4: WIP

* iq4_xxs_r4: 1st shot at AVX2

Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.

* iq3_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 09:12:48 +01:00
Kawrakow
dfa12b7f91 IQ4_KS_R4 (#150)
* iq4_ks_r4: Zen4

* iq4_ks_r4: AVX2

* iq4_ks_r4: WIP

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: NEON

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 19:58:21 +01:00
Kawrakow
59d742b00f IQ5_K_R4 (#149)
* iq5_k_r4: Zen4

Much slower than the others.

* iq5_k_r5: WIP

* Minor

* iq5_k_r4: fix AVX2 nrc_y = 1 case

* iq5_k_r4: better Zen4

But TG is still slower than iq5_k

* iq5_k_r4: slightly better AVX2

* iq5_k_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 13:29:25 +01:00
Kawrakow
9b6d14a299 Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4 (#148)
* Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4

More importantly: simplify.

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 18:55:38 +01:00
Kawrakow
514ae08620 Be able to repack tensors at run time (#147)
* Be able to repack tensors at run time

* Repack: also add bf16 as repackable type

* Repack: make sure number of rows is a multiple of the packing

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 14:16:34 +01:00
Kawrakow
4ade4c568c IQ2_K_R4 (#146)
* iq2_k_r4: Zen4

* iq2_k_r4: NEON

* iq2_k_r4: better matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 10:18:33 +01:00
Kawrakow
d69344f8ea IQ3_K_R4 (#145)
* iq3_k_r4 WIP

* iq3_k_r4: Zen4

* iq3_k_r4: AVX2

* iq3_k_r4: NEON

* iq3_k_r4: faster matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 07:51:11 +01:00
Kawrakow
1714e46f13 Slightly faster IQ4_K_R4 on AVX2/Zen4 (#144)
* iq4_k_r4: slightly better AVX2

227 t/s -> 249 t/s

* iq4_k_r4: slightly better Zen4

232 t/s -> 251 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16 16:09:15 +01:00
Kawrakow
84ab873385 Slightly faster IQ4_XS_R4 on AVX2 (#143)
* iq4_xs_r4: slightly faster and correct AVX2 implementation

* Minor

* Delete unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16 14:40:10 +01:00
Iwan Kawrakow
987a43c531 q8_k_r8: this change for NEON got lost? 2024-12-16 13:31:21 +01:00
Kawrakow
85c5a1a995 BF16_R16 - 16 interleaved bf16 rows (#142)
* Not working bf16_r4

* Adding bf16_r8

Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.

* bf16_rx: Very slightly faster by interleaving 16 rows

258 t/s -> 263 t/s

* Rename bf16_r4 to bf16_r16

We are interleaving 16 rows now.

* Cleanup unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-15 09:54:21 +01:00