Commit Graph

69 Commits

Author SHA1 Message Date
firecoperana
d1f92e24d3 add dry sampler (#513)
* add dry sampler

* use vocab instead of model in dry_init function

* fix compile error for build test

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-19 10:24:53 +03:00
Kawrakow
1d28b2a9a1 Adding top-n-sigma sampler (#489)
* Adding top-n-sigma sampler

* Fix typos in XTC PR

* Update README.md for main and server

* More README

* More README

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 17:35:09 +03:00
Kawrakow
accf69b126 Adding the XTC sampler (#486)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 11:32:03 +03:00
Andrew Chan
25d34e3d2f Trellis quants with CPU inference (#441)
* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 09:17:52 +03:00
Kawrakow
db111c91ee IQ5_KS_R4: row-interleaved IQ5_KS (#426)
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 08:57:26 +03:00
Kawrakow
90e53a0b8b Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
2e585d4508 Enable faster prompt processing with mainline llama.cpp GGUFs (#409)
* Enable MLA-3 in crippled GGUFs: WIP

* Enable MLA-3 in crippled GGUFs: seems to work

* Add newly created tensors to model.tensors_by_name

Else they don't get run-time repacked.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:49:51 +03:00
Kawrakow
aa8ec5dfa6 GPU offload policy (#405)
* Adding GPU offload policy

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:47:46 +03:00
saood06
87bfad8437 Support for Llama-3-Nemotron models (#377)
* conflict resolution

* Changes to make work and add longrope support

* Changes to n_attention_wv rule

* Untested support of 253B

* DeciLMCausalModel now reads rope_theta from config.json properly

* Remove errant Granite mentions

* Better n_attention_vw rule

* Update vocab.py

---------

Co-authored-by: Yee Man Chan <ymchan@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:09:59 +03:00
Kawrakow
5c127b279f LlaMA-4 support (text only) (#321)
* llama4: WIP

* llama4: this seems to be working

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10 09:05:21 +02:00
Kawrakow
8210ed4883 Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00
Kawrakow
79a105d8ab Test transparent huge pages on Linux (#278)
* Adding ability to use THP on Linux

* Use the actual page size4 used for mmap also in munmap

* Add -thp to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:24:43 +01:00
Kawrakow
4158743014 Specify tensor name regex for tensors to be repacked (#274)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 10:51:37 +01:00
Kawrakow
c5e554f941 Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
f8fb8ec9aa Custom quantization rules with regular expressions (#244)
* Custom quantization rules with regular expressions

* Add the --custom-q option to the help

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-07 08:54:09 +02:00
Kawrakow
9424c80ab1 SER - Smart Expert Reduction (#239)
* A better way to measure the cost of ggml_barrier

* Smart expert selection

* Add ser option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-02 13:47:38 +02:00
Kawrakow
e787c00141 Reduce size of compute buffers (#237)
* This reduces compute buffer size for MLA

* This should accomplish it for standard attention

* Much better

* Better concat for contiguous tensors

If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-01 08:25:27 +02:00
Kawrakow
472b4c37c1 Option to use MLA without a transposed cache (#235)
The `-mla` command line option turns into an int from a bool.
mla = 0: use standard attention
mla = 1: use MLA with transposed cache
mla > 1: use MLA without transposed cache

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-27 16:40:49 +02:00
Kawrakow
85c6152e85 Give the user the option to override where model weights are stored (#232)
* Give the user the option to override where model weights are stored

* Fix ggml_nbytes() problem and cleanup

For a tensor with zero elements ggml_nbytes() was returning
uint64_t::max, and this was causing graph allocation failure.

* Add timing info to CUDA graph evaluation

* Add more timing info

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-25 17:55:58 +02:00
Kawrakow
b50efcc9d2 Fused MoE ffn_up and ffn_gate (#229)
* Fusing MoE up * unary(gate)

* Fusing MoE up * unary(gate): CUDA

We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite

* On CUDA also fuse MoE down * (up * unary(gate))

in case the MUL_MAT_ID op for the down experts is the next
op in the graph.

* Command line option to enable fused MoE up*unary(gate)

* Add fmoe option to llama-bench

* Adding forgotten gelu, relu, silu on ARM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23 14:31:11 +02:00
Kawrakow
1140b4568d Q8_KV: 8-bit quantization type targeting the KV cache (#208)
* Adding q8_KV - Basics + AVX2 gemm/gemv

* q8_KV: Better AVX2 gemm

* q8_KV: Better Zen4 gemm

We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.

* q8_KV: AVX2 gemm/gemv

We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.

* q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.

* q8_KV: be able to use it for K cache in FA

* q8_KV: repack it for K*Q in FA

* q8_KV: slightly faster gemv on Zen4

* q8_KV: slightly faster gemv on Zen4

* q8_KV: ARM_NEON

We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.

* q8_KV: use it in FA on NEON

* q8_KV_r8 - repacked q8_KV

On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).

* q8_KV_r8: don't use nrc_y = 16 on Zen4

This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.

* q8_KV: nrc_y = 16 also doesn't pay off in FA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19 11:47:07 +02:00
Kawrakow
3e536b95b0 Add optional MLA (#188)
* Deepseek MLA Optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make MLA optional

* Remove some unnecessary copies in the MLA attention

* Deepseek MLA Optimizations V2 (#195)

* Avoid allocating MHA KV cache when MLA is turned on

* Added missing gguf-py file

* Added final optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make sure we do have wk_b and wv_b before enabling MLA

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use type_k and type_v to set the types of the MLA caches

They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.

* Better gemm strategy when nth > nhead

It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.

---------

Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 19:48:44 +02:00
Kawrakow
8049ffcbc8 Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving

* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving

* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 18:45:28 +02:00
Kawrakow
7c94c3da56 IQ1_M_R4: better 1.75 bpw quants (#187)
* iq1_m_r4: basics (quantize/dequantize)

* iq1_m_r4: Zen4 gemm

* iq1_m_r4: neon gemm

* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.

* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 14:08:52 +02:00
Kawrakow
eb547bad1a IQ1_S_R4: better 1.5 bpw quants (#185)
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05 13:49:39 +02:00
Kawrakow
4e84067f88 Update chat templates (#177)
* Adopting chat template stuff from llama.cpp

* Removing missed conflict marker

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24 06:30:10 +02:00
saood06
5c0a01bdaf Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Kawrakow
da3bfd1009 IQ3_S_R4 (#162)
* iq3_s_r4: WIP

* iq3_s_r4: Zen4

* iq3_s_r4: slightly better Zen4

* iq3_s_r4: AVX2

* iq3_s_r4: NEON

* iq3_s_r4: rearrange quants

* iq3_s_r4: rearranged quants - AVX2

* iq3_s_r4: rearranged quants - NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 14:34:23 +01:00
Kawrakow
7598ec79a2 IQ2_S_R4 (#156)
* iq2_s_r4: Zen4

* Minor

* iq2_s_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 11:26:35 +01:00
Kawrakow
6892554e43 IQ2_XS_R4 (#155)
* iq2_xs_r4: Zen4

* iq2_xs_r4: AVX2

* iq2_xs_r4: slightly better matrix x vector on AVX2

* iq2_xs_r4: NEON - not much better than iq2_xs

* iq2_xs_r4: slightly better NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 08:32:39 +01:00
Kawrakow
9dfd69bd93 IQ2_XXS_R4 (#154)
* iq2_xxs_r4: Zen4

Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread

* Minor

* iq2_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 12:02:42 +01:00
Kawrakow
310f8b1d22 IQ3_XXS_R4 (#153)
* iq3_xxs_r4: 1st shot on Zen4

PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s

* iq4_xxs_r4: WIP

* iq4_xxs_r4: 1st shot at AVX2

Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.

* iq3_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 09:12:48 +01:00
Kawrakow
8352f12275 IQ4_KS_R4 (#150)
* iq4_ks_r4: Zen4

* iq4_ks_r4: AVX2

* iq4_ks_r4: WIP

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: NEON

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 19:58:21 +01:00
Kawrakow
c208fec2f2 IQ5_K_R4 (#149)
* iq5_k_r4: Zen4

Much slower than the others.

* iq5_k_r5: WIP

* Minor

* iq5_k_r4: fix AVX2 nrc_y = 1 case

* iq5_k_r4: better Zen4

But TG is still slower than iq5_k

* iq5_k_r4: slightly better AVX2

* iq5_k_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 13:29:25 +01:00
Kawrakow
a648191c2c Be able to repack tensors at run time (#147)
* Be able to repack tensors at run time

* Repack: also add bf16 as repackable type

* Repack: make sure number of rows is a multiple of the packing

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 14:16:34 +01:00
Kawrakow
c16d352915 IQ2_K_R4 (#146)
* iq2_k_r4: Zen4

* iq2_k_r4: NEON

* iq2_k_r4: better matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 10:18:33 +01:00
Kawrakow
b52e2e2934 IQ3_K_R4 (#145)
* iq3_k_r4 WIP

* iq3_k_r4: Zen4

* iq3_k_r4: AVX2

* iq3_k_r4: NEON

* iq3_k_r4: faster matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 07:51:11 +01:00
Kawrakow
e811de75e9 BF16_R16 - 16 interleaved bf16 rows (#142)
* Not working bf16_r4

* Adding bf16_r8

Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.

* bf16_rx: Very slightly faster by interleaving 16 rows

258 t/s -> 263 t/s

* Rename bf16_r4 to bf16_r16

We are interleaving 16 rows now.

* Cleanup unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-15 09:54:21 +01:00
Kawrakow
e885c1e59b Q8_K_R8: Fastest quantized matrix multiplications (#141)
* q8_k_r8: fastest matrix multiplication known to human kind

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

* q8_k_r8: AVX2

I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.

* q8_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.

* q8_k_r4: go to signed ints

Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
  stay within the signed int range and use _mm256_sign_epi8.
  Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
  needed tto convert from unsigned to signed seems to be extremely
  slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
  or that we now use to convert fro signed to unsigned seems to be
  much faster than on M2-Max)

* Shutup useless compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-14 09:24:30 +01:00
Kawrakow
ce97b0325e IQ4_K_R4 (#138)
* iq4_k_r4: WIP

* iq4_k_r4: Zen4 and hopefully AVX2

On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.

* iq4_k_r4: AVX2

PP-512 = 227.60 t/s. The shifts are really costly.

* iq4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-12 16:04:20 +01:00
Kawrakow
0f6621d410 Q2_K_R4 (#136)
* q2_k_r4: Zen4

PP-512(LLaMA-3.1-8B) = 256 t/s

* q3_k_r4: AVX2

* q2_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 287 t/s.

Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow
forgot to push upstream.

* q2_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.2 t/s.
TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S.

* Make sure rows per thread are a multiple of 4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 18:16:49 +01:00
Kawrakow
4872f2f57e Q3_K_R4 (#134)
* q3_k_r4: Zen4 works, but not as good as it should be

238 t/s, so sloghtly slower than q6_k_r4.

* q3_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.9 t/s.
This is 1.93X faster than q3_K_S!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 11:19:00 +01:00
Kawrakow
e78e47b857 Q5_K_R4 (#132)
* q5_k_r4: WIP

* q5_k_r4: Zen4 and AVX2

We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4.
Q5_K_S has PP-512 = 190 t/s.

* q5_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 96.1 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 18:13:47 +01:00
Kawrakow
b7e2f656f5 Q6_K_R4 (#130)
* Adding q6_k_r4

* q6_k_r4: 1st functional AVX2 version

* q6_k_r4: AVX2 and simple Zen4

"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.

* q6_k_r4: 1st NEON version

PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.

* q6_k_r4: slightly faster NEON

PP-512(LLaMA-3.1-8B) = 83.25 t/s

* q6_k_r4: slightly faster Zen4

238.3 t/s -> 243.2 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 12:26:40 +01:00
Kawrakow
13126ce100 Q4_K_R4 (#129)
* Something is still wrong

* Simply don't see what is wrong

* q4_k_r4: finally works on Zen4

I had forgotten to prevent token_embd.weight being quantized
with q4_k_r4!

* q4_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX.
This is ~30% better than Q4_K_S.

* q4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 110 t/s.
Not quite as good as q4_0_r4, but still a massive
improvement compared to he 69 t/s for q4_K.

* q4_k_r4: slightly better AVX2

PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX

* Minor

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-09 16:59:18 +01:00
Kawrakow
daf5f52022 Rename iq4_nl_x4 to iq4_nl_r4 (#126)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 09:34:42 +01:00
Kawrakow
612a207676 iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)
* Adding iq2_bn_r4

This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.

* Make sure rows per thread are a multiple of the number of interleaved rows

With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.

* iq2_bn_r4: 1st shot at NEON

PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.

* iq2_bn_r4: NEON

PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.

* iq2_bn_r4: Experimenting on NEON

The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.

* Some cleanup

* iq2_bn_r4: AVX2

As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).

* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector

It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.

* iq2_bn_r4: simdify q8_K16 quantization (AVX2)

PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.

* iq2_bn_r4: simdify q8_K16 quantization (NEON)

PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)

* iq2_bn_r4: fix AVX2 after breaking it two commits ago

* iq2_bn_r4: better AVX2

As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06 12:15:39 +01:00
Kawrakow
9119023a4b IQ4_XS_R4 (#123)
* Adding iq4_xs_r4

This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.

* iq4_xs_r4: WIP

* iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4

* iq4_xs_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!

* DRY

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04 15:20:07 +01:00
Kawrakow
bb699e1e6b Q6_0_R4 (#122)
* Adding q6_0_r4

We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X.

* q6_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max.
In terms of ops, q6_0_r4 is identical to q5_0_r4
except for loading the high bits being
vld1q_u8_x2 instead of vld1q_u8. It is strange that
this can make a 5% difference in performance, especially
considering that this is amortized (re-used) over 8 columns
in the right matrix. Or am I running out of vector registers?

* Fix AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 14:48:26 +01:00
Kawrakow
d9593f3689 Q5_0_R4 (#121)
* Adding q5_0_r4

We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X.
We even get TG-128 improvement to 11.7 t/s from 11.1 t/s.

* q5_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 12:59:22 +01:00