Commit Graph

3582 Commits

Author SHA1 Message Date
Iwan Kawrakow
d29f8d3d40 Add the --custom-q option to the help 2025-03-06 18:38:32 +02:00
Iwan Kawrakow
480c2265cd Custom quantization rules with regular expressions 2025-03-06 17:35:14 +02:00
Kawrakow
7bdbf99bbd DeepSeek CUDA Flash Attention (#241)
* WIP CUDA FA with Dk != Dv

* WIP

* CUDA FA WIP - It actually works!

No TG yet, but for PP I can run FA with fp16 cache and it gets
the same answer.

* CUDA FA WIP - it now works for Q8_0 + Q8_0 for KV cache

* CUDA FA WIP - TG, not working yet.

* CUDA FA with Dk != Dv: it works now for DeepSeek

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-05 07:27:49 +02:00
Kawrakow
a87e54db6e Flash MLA (CPU only) (#240)
* FlashMLA - it finally works (on the CPU)

* FlashMLA: allow for f16 and bf16 cache in addition to q8_0

* It works with ggml FA, not with iqk FA

* WIP

* FlashMLA: it now works with iqk

I had forgotten to divide the Q stride by sizeof(float) and
that's why, very cobfusingly, it was working for TG but not for PP.

* WIP

* FlashMLA: that should be it for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-03 15:17:51 +02:00
Kawrakow
a89adaa78f SER - Smart Expert Reduction (#239)
* A better way to measure the cost of ggml_barrier

* Smart expert selection

* Add ser option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-02 13:47:38 +02:00
Kawrakow
ef9a3d17b5 A better way to measure the cost of ggml_barrier (#238)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-01 17:12:58 +02:00
Kawrakow
a79ab8f342 Reduce size of compute buffers (#237)
* This reduces compute buffer size for MLA

* This should accomplish it for standard attention

* Much better

* Better concat for contiguous tensors

If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-01 08:25:27 +02:00
Kawrakow
b762db7c92 Option to use MLA without a transposed cache (#235)
The `-mla` command line option turns into an int from a bool.
mla = 0: use standard attention
mla = 1: use MLA with transposed cache
mla > 1: use MLA without transposed cache

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-27 16:40:49 +02:00
Kawrakow
51029edfdf Faster MLA on CUDA (#234)
* Slight MLA TG performance improvement on CUDA

The low MLA performance on CUDA is dues to
the wk_b * q_nope operation.

It turns into n_head matrix multiplications with
n_head separate quantization and GEMV steps.
The associated overhead is just too much for TG
where each GEMV is very fast (512 x 128 = 131 KFLOP
for DeepSeek-Lite, 4X that for DeepSeekV3/R1).
The way it was done there was also a copy of each q_nope
row before quantization, which I have now eliminated.
This results in a ~2.5% speedup.
What needs to happen instead is to launch a single
computation that quantizes all heads, and then have
a kernel that does the GEMV for all heads instead of
n_head sequential GEMVs.

* Slightly better

* CUDA: Quantize non-contiguous tensors

* Much better MLA

It is a total hack, but it works.

* Cleanup

Remove duplicated gemv's.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-27 08:42:18 +02:00
Kawrakow
94b659a2f1 Give the user the option to override where model weights are stored (#232)
* Give the user the option to override where model weights are stored

* Fix ggml_nbytes() problem and cleanup

For a tensor with zero elements ggml_nbytes() was returning
uint64_t::max, and this was causing graph allocation failure.

* Add timing info to CUDA graph evaluation

* Add more timing info

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-25 17:55:58 +02:00
Kawrakow
547eee81d9 Fix #230 (#231)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-24 09:29:58 +02:00
Kawrakow
ac1d259b93 Fused MoE ffn_up and ffn_gate (#229)
* Fusing MoE up * unary(gate)

* Fusing MoE up * unary(gate): CUDA

We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite

* On CUDA also fuse MoE down * (up * unary(gate))

in case the MUL_MAT_ID op for the down experts is the next
op in the graph.

* Command line option to enable fused MoE up*unary(gate)

* Add fmoe option to llama-bench

* Adding forgotten gelu, relu, silu on ARM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23 14:31:11 +02:00
saood06
46bf73a37f Add new sweep-bench benchmark (#225)
* examples : add new sweep-bench benchmark

* Change documentation to reference ik_llama.cpp

* Made it compile with ik_llama

* Fix JSONL output

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-23 00:16:27 -06:00
Kawrakow
71b7b510c2 Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23 08:02:16 +02:00
Kawrakow
4926105844 Fix #217 (#220)
* Fix #217

* Remove stuff commited by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22 14:25:38 +02:00
Kawrakow
33646fc409 Fuse MoE up and gate matrix multiplications (#219)
* This seems to be a better way

to do the attention matrix multiplications in the TG case.

* Cleanup

* Fuse up and gate gemms in MoE models

Small (~1-2%) but measurable performan ce gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22 09:41:40 +02:00
Kawrakow
c4a5103299 Better strategy for attention matrix multiplications when generating tokens (#218)
* This seems to be a better way

to do the attention matrix multiplications in the TG case.

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22 09:38:51 +02:00
Kawrakow
b9a6639ac3 Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD (#216)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-21 15:33:25 +02:00
Iwan Kawrakow
4b45b82e67 Honor attn_output specified in the command line also for low-bit quants 2025-02-20 17:42:07 +02:00
Kawrakow
a45da7bfbf Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128 (#213)
* Fix gemm/gemv for legacy quants when row size is not divisible by 128

* Fix typo

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-20 13:55:13 +02:00
Kawrakow
498a582919 Optimized GEMM/GEMV for IQ1_S (#212)
* Adding iq1_s to iqk_mul_mat (Zen4)

* iq1_s: slightly better on Zen4

* iq1_s: AVX2

* iq1s: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-20 12:41:45 +02:00
Kawrakow
a0ebfdd661 Q8_KV: 8-bit quantization type targeting the KV cache (#208)
* Adding q8_KV - Basics + AVX2 gemm/gemv

* q8_KV: Better AVX2 gemm

* q8_KV: Better Zen4 gemm

We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.

* q8_KV: AVX2 gemm/gemv

We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.

* q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.

* q8_KV: be able to use it for K cache in FA

* q8_KV: repack it for K*Q in FA

* q8_KV: slightly faster gemv on Zen4

* q8_KV: slightly faster gemv on Zen4

* q8_KV: ARM_NEON

We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.

* q8_KV: use it in FA on NEON

* q8_KV_r8 - repacked q8_KV

On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).

* q8_KV_r8: don't use nrc_y = 16 on Zen4

This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.

* q8_KV: nrc_y = 16 also doesn't pay off in FA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19 11:47:07 +02:00
Kawrakow
047ba895bb Repack also experts (#210)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19 10:01:49 +02:00
Iwan Kawrakow
d44aba79ea Bug fix in activation quantization
I added a change in the last PR how activations are quantized.
It looked like it is working and slightly improving performance.
But I now hit an edge case where I get gibberish that goes away if
I remove the change. I absolutely don't see what goes wrong, so
leaving the change in commented out for now.
2025-02-15 19:50:53 +02:00
Kawrakow
0551e7630b Moving 4D gemm logic from ggml.c to iqk_mul_mat.cpp (#207)
This allows us to optimize TG performance for GQA models.
E.g., for IQ4_XS L3-8B with 8k TG-64 goes from 8.6 to 10.26 t/s.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-15 08:45:45 +02:00
Kawrakow
8e94b29e35 MLA: allow Q8_0 K-cache for MLA (#206)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13 14:44:33 +02:00
Kawrakow
05242ff17d Faster MLA prompt processing (#205)
* Do not allocate / report caches that are not used

It is either the standard KV cache or MLA cache, not both.

* Rename X_pe to X_rope

Much easier to follow, at least for my brain, when we have
  X_rope : rotational position encoding
  X_nope :         no position encoding
instead of X_pe and X_nope, where I was wondering wtf is 'pe'
and 'nope'.

* WIP

* WIP

* WIP

* WIP

* Warn user when disabling MLA

* MLA: compile time option to not use transposed KV cache

Cuts KV cache size in nearly half at the expense of slower
TG performance for long contexts (it becomes similar to
no-MLA).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13 11:50:20 +02:00
Kawrakow
1bbb543478 Fix iqk_mul_mat on AVX512 systems that are missing BF16 support (#204)
* Fix iqk_mul_mat on AVX512 systems that are missing BF16 support

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12 14:22:26 +02:00
Kawrakow
e974fc9e66 Fix imatrix overprotectiveness (#202)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12 07:20:38 +02:00
Kawrakow
3c98bfb33d DeepSeek FA support (CPU only) (#200)
* Adding support for K head size != V head size

This is relevant for DeepSeek models.
At this point ggml CPU FA works.
Now I need to go and change iqk FA to make it work
with Dk != Dv.

* iqk support for K head size != V head size

To not have compilation time explode, just
Dk = 192, Dv = 128 for now (DeepSeek)

* FA: very slightly faster for nq = 1 (TG)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-11 14:46:30 +02:00
saood06
a366a3d17d Load all MoE experts during warmup and make warmup 1 token (#198)
* Load all MoE experts during warmup

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Unify warmup to one token

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-10 17:40:38 +02:00
Kawrakow
c12f73ba61 Add optional MLA (#188)
* Deepseek MLA Optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make MLA optional

* Remove some unnecessary copies in the MLA attention

* Deepseek MLA Optimizations V2 (#195)

* Avoid allocating MHA KV cache when MLA is turned on

* Added missing gguf-py file

* Added final optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make sure we do have wk_b and wv_b before enabling MLA

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use type_k and type_v to set the types of the MLA caches

They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.

* Better gemm strategy when nth > nhead

It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.

---------

Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 19:48:44 +02:00
Kawrakow
cae2b81155 FA: Add option to build all FA kernels (#197)
Similar to the CUDA situation.
It is OFF by default.
If OFF, only F16, Q8_0, Q6_0, and, if the CPU provides native
BF16 support, BF16 FA kernels will be included.
To enable all, cmake -DGGML_IQK_FA_ALL_QUANTS=1 ...
This cuts compilation time for iqk_mul_mat.cpp by almost half
(45 seconds vs 81 seconds on my Ryzen-7950X).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 18:59:33 +02:00
Kawrakow
33390c4b74 Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194)
* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4)

* iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4)

* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon)

* iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon)

* Simdify q8_K128 quantization also on Neon

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 09:14:52 +02:00
Kawrakow
6d7b58eade Revert #79 (#192)
* Revert "Do not quantize activations if not necessary (#79)"

This reverts commit 0bf4d99774.

* Fixed compilation after revert

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-08 09:48:59 +02:00
Kawrakow
4601a8c373 cuda: non-contiguous rms norm (#190)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07 08:33:42 +02:00
Kawrakow
b08a2e9dfc Add additional checks for iq1_s_r4 quantization (#191)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07 08:33:28 +02:00
Kawrakow
a08501ee52 Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving

* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving

* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 18:45:28 +02:00
Kawrakow
7f61b3068e IQ1_M_R4: better 1.75 bpw quants (#187)
* iq1_m_r4: basics (quantize/dequantize)

* iq1_m_r4: Zen4 gemm

* iq1_m_r4: neon gemm

* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.

* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 14:08:52 +02:00
Kawrakow
a6f9f2ec9a iq1_s_r4: slightly faster NEON gemm/gemv (#186)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05 14:45:51 +02:00
Kawrakow
8b7536bda8 IQ1_S_R4: better 1.5 bpw quants (#185)
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05 13:49:39 +02:00
Kawrakow
ecf111a11c Deepseek-Lite (#184)
* Quantization mixes tweaks

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on Zen4

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q5_0_r4 work with row size that are not a multiple of 128

... on Zen4 and AVX2

* Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128

also on NEON.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30 18:36:24 +02:00
Kawrakow
2e6b523853 Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 (#182)
* Slightly faster AVX2 implementation for q4_k_r4

* Even better AVX2 implementation for q4_k_r4

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f8722.
With FA and Q8_0 K-cache we get to 339.5 t/s.

* Fix llama-bench labels that I broke with #181

* Faster AVX2 implementation for q5_k_q4

We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.

* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4

After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.

* Minor tweak

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30 09:28:53 +02:00
Kawrakow
4a73c25002 Various (#181)
* Adding gp option to llama-bench

Similar to pg, but it only looks at TG speed with a given
prompt length.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

They still need to be divisible by 32.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.. on NEON

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on NEON

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on Zen4.

Also fix q8_0 K-cache for head sizes that are not multiple of 128.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-29 14:05:41 +02:00
Kawrakow
f725576345 Minor performance improvements (#179)
* Try interleaving 8 rows for iq4_xs

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).

* Try interleaving 8 iq4_xs rows

It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.

* Cleanup

* 8-rows interleaved q8_0 (AVX2)

* 8-rows interleaved q8_0 (Zen4)

* 8-rows interleaved q8_0 (Zen4) - slightly better

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.

* 8-rows interleaved q8_0 (NEON)

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.

* FA: repack Q8_0 to Q8_0_R8

* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

* FA: repack Q8_0 to Q8_0_R8 (NEON)

Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.

* q4_0_r8 (AVX2)

* q4_0_r8 (NEON)

Tiny bit faster PP (~128 vs ~126 t/s), same TG.

* q4_0_r8 (Zen4)

Somehow only marginally faster?
268 t/s vs 261 t/s

* q4_0_r8 (Zen4) - slightly better

282 t/s for a pure q4_0 L3-8B quantization.

* Apply platform specific modifications when repacking

E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0.
This results in a ~3% performance improvement.
Hence,
* Changed the signature of the repack_X functions to take a
  bool argument indicating if the repacking is done online and,
  if so, apply modifications as appropriate while repacking.
* Added iqk_modify_tensor to apply modifications to models that
  have already been repacked while loading the model. Caveat:
  just like rtr, this needs to have mmap disabled (else one would
  need to move the data to a not mmap-ed buffer, so much more
  complicated).

* Apply platform specific modifications when repacking

On Zen4 we can pre-convert the signed quants in q8_0_r4 and
q8_k_r8 to unsigned thus avoiding these operations in matrix
multiplications. With this change we hit
PP-512 = 382.40 t/s (q8_k_r8)
PP-512 = 306.92 t/s (q8_0_r4)
for L3-8B on a Ryzen-7950X using q8_0 KV-cache.

* Process up to 16 columns per kernel call for q8_k_r8

This brings PP-512 up to 389 t/s.

* Be able to load Deepseek-v2-Lite

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27 18:53:47 +02:00
Kawrakow
d9c4ea48d1 Interleave 8 rows (Q8_0, IQ4_XS) (#178)
* Try interleaving 8 rows for iq4_xs

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).

* Try interleaving 8 iq4_xs rows

It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.

* Cleanup

* 8-rows interleaved q8_0 (AVX2)

* 8-rows interleaved q8_0 (Zen4)

* 8-rows interleaved q8_0 (Zen4) - slightly better

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.

* 8-rows interleaved q8_0 (NEON)

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.

* FA: repack Q8_0 to Q8_0_R8

* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

* FA: repack Q8_0 to Q8_0_R8 (NEON)

Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27 16:50:07 +02:00
Kawrakow
814d3e054c Update chat templates (#177)
* Adopting chat template stuff from llama.cpp

* Removing missed conflict marker

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24 06:30:10 +02:00
saood06
2195632581 Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Iwan Kawrakow
c2624b2fd3 Add Deepseek-R1-Distill pre-tokenizer 2025-01-23 13:10:03 +02:00
Kawrakow
dbf5d31d01 Better BF16 support on AVX2 (#175)
* Adding BF16 support for AVX2

PP performance is the same as fp16 (~153 t/s on Ryzen-5975WX),
but TG is quite a bit lower (3.65 t/s vs 4.72 t/s at 8 threads).
Why?

* Slightly faster fp16/bf16 gemv on AVX2

It still saturates at the same lower peformance for bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-22 12:13:55 +02:00