Commit Graph

164 Commits

Author SHA1 Message Date
Iwan Kawrakow
8b102792d8 iq4_nl
84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s
2025-06-21 15:13:45 +02:00
Iwan Kawrakow
a78bed0f8e q8_0
84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.
2025-06-21 12:43:30 +02:00
Iwan Kawrakow
a834e4be15 q6_0
74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.
2025-06-21 12:33:52 +02:00
Iwan Kawrakow
f8efac6295 q5_0
74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.
2025-06-21 12:21:02 +02:00
Iwan Kawrakow
1f31789b96 q4_0
83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s
2025-06-21 11:58:38 +02:00
Kawrakow
1843ed22c5 New integer trellis on ARM_NEON (#544)
* Adapt iq3_kt to new trellis on NEON

* iq3_kt is now working on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-20 09:26:36 +03:00
Kawrakow
d85c64428e New IQ2_KT, IQ3_KT and IQ4_KT, V2 (#529)
* New iq4_kt trellis

The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.

* Something is not working with the AVX2 dot product

* New iq4_kt: CUDA MMVQ

* New iq4_kt: CUDA MMQ

* For now have only iq4_kt use the new trellis

* Fix iq2_kt that got broken along the way

* New iq4_kt: AVX2 dot product finally works

We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.

* New iq4_kt: fix vanilla AVX2

* New iq4_kt: NEON implementation

We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.

* New iq4_kt: slightly faster NEON

* New iq4_kt: slightly faster NEON

* New iq4_kt: faster NEON

We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.

* Minor

* New iq4_kt trellis: not working Metal implementation

* Remove the extra 4 bytes of row meta data that is no longer used

* Cleanup

* Adding forgottent file

* Switching iq2_kt to new trellis - CUDA MMQ

* New iq2_kt: CUDA GEMV

* New iq2_kt: AVX2 dequantize

* New iq2_kt: AVX2 GEMM/GEMV

* Adding forgotten file

* New iq2_kt: NEON GEMM/GEMV

* New iq2_kt: slightly faster NEON GEMM

* New iq2_kt: Metal - very slow.

It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.

* Add missing break

* Trying @louiehelm's multiplier

* CPU

* iq3_kt: use integer trellis + CUDA dequantize and MMVQ

* iq3_kt: MMQ

* iq3_kt: AVX2 GEMM

* iq3_kt: AVX2 GEMV

* The trellis quants now need super-blocks of 256, so we need a check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 16:20:54 +03:00
Kawrakow
c410cc72bb Much faster CPU prompt processing (part 3) (#534)
* Repack q4_0 and q8_0 to q8_0_R8

q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.

* Change q8_2_x4 to store in16_t sums

With that q4_0 now works.
I need to check all quants that use q8_2_x4!

* q5_0 and use a dequntizing template

* q6_0

129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.

* iq4_nl

137 t/s -> 293 t/s. iq4_nl is at 251 t/s.

* q4_1: 135 t/s -> 262 t/s

* q5_1: 125 t/s -> 253 t/s

* iq3_xs

178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.

* q2_K

202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 15:30:56 +03:00
Kawrakow
dc96820ddb Much faster CPU prompt processing (part 2) (#533)
* iq4_ks

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

* iq4_k

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.

PPL is actually lower!

* iq5_ks

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.

PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

* iq5_k - accuracy loss is too big

* iq5_k - there was a bug with the shifts

...and that's why PPL was so high. It is also high on main.
This fixes it.

* iq6_k

148 t/s -> 350 t/s. There is no iq6_k_r4

PPL is actually lower because we have a bug in the existing
implementation!

* iq3_k

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

* iq2_k

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

* iq2_ks

200 t/s -> 367 t/s. There is no iq2_ks_r4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 07:29:33 +03:00
Kawrakow
0f8f8b32e2 Much faster CPU prompt processing (part 1) (#531)
* q6_K dequantizing GEMM

* Much easier: just use different vec_dot types!

* WIP

* Finally q6_K x q8_2_x4 dot product works

* Very slightly better

* We don't need the changes in ggml.c

* Fix AVX2

* iq2_xs

* Fix AVX2

* iq2_s

* q3_K

* Fix q8_k_r8 on Zen4

* q3_K: repack to q8_k_r8 instead of q8_0_r8

With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.

* q3_K: don't scale when all quants in a block are <= 127 when repacking

* iq2_s: repack to q8_k_r8 instead of q8_0_r8

* iq2_xs: rapck to q8_k_r8

* WIP

* iq2_xs: repack to q8_k_r8

* iq3_xxs: repack to q8_k_r8

* iq3_s: use q8_k_r8

* iq1_s: repack to q8_k_r8

* iq1_m: repack to q8_k_r8

* iq1_m: slightly faster

* Slightly faster

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-17 07:12:48 +03:00
Kawrakow
6fc5bbb657 Call iqk_convert_repack in MoE GEMM (#528)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-14 05:52:46 +03:00
Kawrakow
066ed4fd11 Faster CPU prompt processing for Q4_K and Q5_K (#525)
* q4_K: dequantize to q8_1_r8 for batch >= 32

We get 268 t/s, up from 186 t/s.

* q4_K: GEMM with q8_2_X4

* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

* Remove the scales, they are not needed

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-13 07:58:15 +03:00
Kawrakow
4fc3cb4a47 iq3_s: much faster GEMM via repacking to q8_0_r8 (#518)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-12 08:16:12 +03:00
Kawrakow
3f54b49786 Faster iq1_s GEMM via repacking to Q8_0_R8 (#517)
TG is slightly faster too - 24.4 vs 23.1 t/s on the
Ryzen-5975WX

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 15:01:34 +03:00
Kawrakow
69af3f5990 Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2) (#516)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 13:05:26 +03:00
Kawrakow
e56061fa12 IQ2_XXS: much faster CPU prompt processing (#515)
* Much faster iq2_xxs GEMM

PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.

* iq2_xxs: q8_2_x4 GEMM

* iq2_xxs: use template for q8_2_x4 GEMM

* Fix AVX2

* Cleanup

* NEON is not working yet, so still use Q8_K GEMM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 11:12:30 +03:00
Kawrakow
0b10f7418f Faster CPU prompt processing for Trellis quants and MoE models (#488)
* Also do the dequantize approach for mul_mat_id

* Also do the dequantize approach for iqk_moe_fused_up_gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:30:35 +03:00
Kawrakow
3df1a3a44d Trellis quants: faster CPU prompt processing (#482)
* Experimenting with dequant + f32 GEMM

For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.

* Experimenting with dequant + f32 GEMM

iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s

* Experimenting with dequant + f16 GEMM on NEON

iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s

Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.

* Experimenting with dequant + f16 GEMM on NEON

iq4_kt: PP512 = 86 t/s from 29 t/s

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:24:05 +03:00
Kawrakow
1eac9e8487 NEON implementation for trellis quants (#471)
* iq2_kt: NEON implementation

* iq3_kt: NEON implementation

* iq4_kt: not working NEON implementation

* iq4_kt: NEON implementation

Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.

* Cleanup

* iq4_kt: slightly faster TG on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-29 18:57:41 +03:00
Andrew Chan
a1c931c30c Trellis quants with CPU inference (#441)
* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 09:17:52 +03:00
Kawrakow
b94cd3b632 Refactor iqk_mul_mat.cpp (#435)
* Refactor iqk: WIP

* Refactor iqk: Factor out float GEMM (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

* Refactor iqk: Factor out GEMM for repacked legacy quants

* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

* Refactor iqk: Factor out GEMM for repacked i-quants

* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

* Refactor iqk: factor out 1-bit quants (NEON)

* Refactor iqk: factor out k-quants (NEON)

* Refactor iqk: factor out floats (NEON)

* Also iq4_xs belongs to k-quants

* Refactor iqk: factor out iqk quants (NEON)

* Refactor iqk: factor out legacy quants (NEON)

* Refactor iqk: factor out repacked legacy quants (NEON)

* Refactor iqk: factor out repacked k-quants (NEON)

* Refactor iqk: factor out repacked iqk quants (NEON)

* Refactor iqk: GEMM kernels are refactored on NEON

* Refactor iqk: FA compiles

If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X

* Refactor iqk: FA refactored (Zen4)

Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.

* Adding forgotten file

* Most helpers don't need to be templates

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX

* Fix bf16

* Refactor iqk: FA refactored (NEON)

* Forgotten MMQ ref and typo (#431)

* Adding forgotten iq5_k_r4

* Fix iq4_k_r4 on NEON

* Fix iq4_ks on NEON

It was broken before the refactoring (the shifts were not correctly
applied).

* Fix q8_0 on NEON

* Fix q6_0 K cache

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
2025-05-22 10:05:51 +03:00
Kawrakow
b3036a872f Option to enable disable the IQK CPU FA kernels (#429)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 11:21:58 +03:00
Kawrakow
c35a383bcd Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)
* Zen4: faster PP for iq4_ks and iq5_ks

* Zen4: faster PP for iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 10:42:33 +03:00
Kawrakow
7abdf2b099 IQ5_KS_R4: row-interleaved IQ5_KS (#426)
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 08:57:26 +03:00
Kawrakow
134d548173 Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)
* Fix IQ4_K on AVX2

* Fix IQ4_KS on AVX2

* Fix IQ5_K on AVX2

* Fix IQ6_K on AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16 17:25:15 +03:00
Kawrakow
3d92d7f802 Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
3f8c865b92 Fix standard attention on the CPU (#421)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:43:39 +03:00
Kawrakow
13740622e9 Fix SER (CPU) (#415)
* Fixing SER bugs

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:55:04 +03:00
Kawrakow
553c08b6b4 Better CPU FA performance for DeepSeek-Lite (#410)
* Better CPU FA performance for DeepSeek-Lite

* It must be like this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:53:20 +03:00
Kawrakow
8a5c0410e1 Fix DeepSeek q8_0 cache (#391)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 12:06:49 +03:00
Kawrakow
090eae4d69 Fix build for Xeon Gold 6226R (#390)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 10:33:27 +03:00
Kawrakow
1ea1df4b2d Fix FA bug on AVX2 (#364)
* Fix FA bug on AVX2

* Also this was wrong

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02 07:09:09 +02:00
Kawrakow
4c2bee0bed Fix IQK_FA_ALL_QUANTS on AVX2 (#360)
* Fix IQK_FA_ALL_QUANTS on AVX2

* Make it also work, not just compile

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 10:45:43 +02:00
Kawrakow
cda24b58cb CPU FA improvements (#351)
* FA: provide work buffer for K repacking

* Add header to avoid comp0iler warnings

* WIP

* WIP

* WIP

* WIP

* Slightly better

* WIP (Zen4)

* WIP

* Try to improve for unusual number of heads/number of threads

* Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA

* Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA

* Use Sum4q4 for q4_0

* WIP

* WIP

* Much better FA TG with q8_0 KV cache

Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 07:19:43 +02:00
Kawrakow
715fc552ad Add support for Cohere2 (#341)
* Add support for Cohere2

* Fixe IQ4_NL on AVX2

* Command-A needs fp32 precision for K*Q

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26 08:13:25 +02:00
Kawrakow
25d1a0dca8 Fix FA on ARM (#346)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 11:01:08 +02:00
saood06
93cd77b655 Fix termux/android build (#336)
* Attempt fix

* Attempt fix 2

* Attempt fix 3

* Attempt fix 4

* Attempt fix 5

* Attempt fix 6

* Attempt fix 7

* Attempt fix 8

* Attempt fix 9

* Attempt fix 10

* Attempt fix 11

* Attempt fix 12

* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
3bb64d9330 Better TG performance for GQA models (CPU) (#332)
* Slightly better CPU TG performance for GQA

* Better CPU FA implementation for TG when GQA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-17 08:08:40 +02:00
Kawrakow
f7c5a94e75 Better gemm/gemv on AVX2 fr q4_0_r8 (#331)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:18:50 +02:00
Kawrakow
2ee6263e24 Fix GCC compilation errors on ARM (#309)
* Fix GCC compilation errors on ARM

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 15:50:53 +02:00
Kawrakow
6e5156cab5 Fix #300 (#301)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:25 +02:00
Kawrakow
d0b52076da Use bf16 instead of fp16 block scales for q8_1 (#292)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* q8_0_r8 on avx2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 05:49:16 +01:00
Kawrakow
f9307d7907 Improve DeepSeek batched processing speed (#282)
* Improve DeepSeek batched processing speed

* Revert the commented out section in iqk_mul_mat.cpp

It does have some benefit at long contexts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 17:10:52 +01:00
Kawrakow
b8d1fac97b Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
3d85a1d663 Better FlashMLA (#243)
* This is a better FA for TG

It should benefit MLA and GQA. Tested to work with
DeepSeek-Lite MLA, not yet for GQA.
For tg64@pp8192 it is ~13% faster than MLA without FA,
and 57% faster that the main branch FA.

* WIP

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-07 09:46:58 +02:00
Kawrakow
a87e54db6e Flash MLA (CPU only) (#240)
* FlashMLA - it finally works (on the CPU)

* FlashMLA: allow for f16 and bf16 cache in addition to q8_0

* It works with ggml FA, not with iqk FA

* WIP

* FlashMLA: it now works with iqk

I had forgotten to divide the Q stride by sizeof(float) and
that's why, very cobfusingly, it was working for TG but not for PP.

* WIP

* FlashMLA: that should be it for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-03 15:17:51 +02:00
Kawrakow
547eee81d9 Fix #230 (#231)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-24 09:29:58 +02:00
Kawrakow
ac1d259b93 Fused MoE ffn_up and ffn_gate (#229)
* Fusing MoE up * unary(gate)

* Fusing MoE up * unary(gate): CUDA

We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite

* On CUDA also fuse MoE down * (up * unary(gate))

in case the MUL_MAT_ID op for the down experts is the next
op in the graph.

* Command line option to enable fused MoE up*unary(gate)

* Add fmoe option to llama-bench

* Adding forgotten gelu, relu, silu on ARM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23 14:31:11 +02:00
Kawrakow
71b7b510c2 Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23 08:02:16 +02:00
Kawrakow
4926105844 Fix #217 (#220)
* Fix #217

* Remove stuff commited by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22 14:25:38 +02:00