Commit Graph

3277 Commits

Author SHA1 Message Date
Kawrakow
f6bfdce911 Bitnet(2.25 bpw): faster Metal dot product
With this we get TG-128 = 97 t/s.
2024-06-22 12:02:52 +03:00
Kawrakow
f200d36a7f Bitnet(2.25 bpw): Metal
We get PP-512 = 702 t/s, TG-128 = 84 t/s.
This is almost on par with q4_0, which is rare on Metal
(to not say it does not exist).
For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet.
TG is kind of funny because we hit 72 t/s on the CPU.
2024-06-22 12:02:52 +03:00
Kawrakow
ff718c2dc1 Bitnet(2.25 bpw): CUDA
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
2024-06-22 12:02:52 +03:00
Kawrakow
766975ecfa Bitnet(2.25 bpw): NEON
We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22 12:02:52 +03:00
Kawrakow
39982764d7 Bitnet: 2.25 bpw version
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
2024-06-22 12:02:52 +03:00
Kawrakow
68741281e5 bitnet 2 bpw: NEON implementation
We get PP-512 = 190 t/s and TG-128 = 75 t/s.
2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22 12:02:52 +03:00
Kawrakow
a8521b73d7 Removed extra column 2024-06-22 12:02:52 +03:00
Kawrakow
8ca1bdebe4 bitnet 2 bpw: AVX2 implementation
We get PP-512 = 322 t/s.
TG is already 51.6 t/s at 4 threads, then it saturates and
starts going down for more than 8 threads.
2024-06-22 12:02:52 +03:00
Kawrakow
318899c8b7 bitnet: add 2 bpw quantization
The scalar dot product already chieves 37 t/s for TG!
2024-06-22 12:02:51 +03:00
Kawrakow
f9ba085ef7 Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice 2024-06-22 12:02:51 +03:00
Kawrakow
0efd620d01 iqk_mul_mat(bitnet): fix typo
With the last change (which added the typo), I'm now getting
PP-512 = 300 t/s on the Ryzen-5975WX.
2024-06-22 12:02:51 +03:00
Kawrakow
7b3cb2b96c iqk_mul_mat(bitnet): slightly faster AVX2
We now get 214 t/s on the Ryzen-7950X
2024-06-22 12:02:51 +03:00
Kawrakow
e6d8441397 iq1_bn: better NEON implementation
PP is decent with 131 t/s (q4_0 has 150 t/s).
TG is better than last commit but still bad at 33.1 t/s
(in comparison q4_0 gets 52.3 t/s).

I had to go to the (0, 1, 2) table. Apple Silicon clearly
does not like operations with signs.
2024-06-22 12:02:51 +03:00
Kawrakow
3686304e03 iq1_bn(NEON): works now, but very slow
Basically 2X slower tan q4_0.
2024-06-22 12:02:51 +03:00
Kawrakow
798697a6ff iq1_bn(Metal): 66.2 -> 67.1 t/s 2024-06-22 12:02:51 +03:00
Kawrakow
bd266036b6 iq1_bn(Metal): 64 -> 66.2 t/s for TG
This should be good enough. One cannot ask
Apple Silicon to do too much work.
2024-06-22 12:02:51 +03:00
Kawrakow
7cb77d7a67 iq1_bn(Metal): 64 -> 66.2 t/s for TG 2024-06-22 12:02:51 +03:00
Kawrakow
04fed5cd9f iq1_bn(Metal): 60 -> 64 t/s for TG 2024-06-22 12:02:51 +03:00
Kawrakow
5d14a2243e iq1_bn: very slightly better Metal dot product 2024-06-22 12:02:51 +03:00
Kawrakow
15e1aec7a5 iq1_bn: Metal now works
PP performance is decent (668 t/s v 724 t/s for q4_0),
but TG is kind of low (60 t/s vs 81 t/s for q4_0).
2024-06-22 12:02:51 +03:00
Kawrakow
4b64224645 iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working 2024-06-22 12:02:51 +03:00
Kawrakow
77d8637925 iqk_mul_mat(iq1_bn): WIP NEON (not working) 2024-06-22 12:02:51 +03:00
Kawrakow
dfdc4dbee6 iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2
I now get PP-512 = 270 t/s on the Ryzen-5975WX
2024-06-22 12:02:51 +03:00
Kawrakow
dff96fb5f8 iqk_mul_mat: improve iq1_bn (bitnet) on AVX2
We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Kawrakow
b0967ffa79 bitnet: fix scalar dot product
I had forgotten to adjust for the change to q8_K64.
On the M2 I'm getting 10.8 t/s with the scalar version!
2024-06-22 12:02:51 +03:00
Kawrakow
88e98260bf bitnet: scale is per row, not per tensor 2024-06-22 12:02:51 +03:00
Kawrakow
077270395b iqk_mul_mat: add iq1_bn (bitnet)
We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Kawrakow
eecd48eab5 bitnet: CUDA, scalar, AVX2 2024-06-22 12:02:51 +03:00
Kawrakow
81576cdcac bitnet: python + llama 2024-06-22 12:02:51 +03:00
Kawrakow
f9490aea46 iqk_mul_mat: cleanup 2024-06-22 12:02:50 +03:00
Kawrakow
389e6220e9 iqk_mul_mat: be independent of llamafile_sgemm
Verified that it works on AVX2.
Also turned on any combination of f16 and f32
(i.e., added f16 x 16 and f32 x f32).
2024-06-22 12:02:50 +03:00
Kawrakow
915a1b2665 iqk_mul_mat: be independent of llamafile_sgemm (WIP)
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat

It is marked WIP because only tested on __aarch64__
2024-06-22 12:02:50 +03:00
Kawrakow
cc628b2e39 Fix nb4 2024-06-22 12:02:50 +03:00
Kawrakow
d41aef5418 iqk_mul_mat: add ability to disable it 2024-06-22 12:02:50 +03:00
Kawrakow
154f56a8de iqk_mul_mat: be able to handle any f16/f32 combination on AVX2
But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22 12:02:50 +03:00
Kawrakow
1211a4b5d0 iqk_mul_mat: turn on AVX512
It makes no difference on my Ryzen-7950X, but perhaps
it will be beneficial for CPU's with real AVX512.
2024-06-22 12:02:50 +03:00
Kawrakow
dfcb8bebc5 iqk_mul_mat: slightly better fp16 with 16 vector registers
2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX
up from 138 t/s. We use Nx registers to preload the fp16 weights,
so total registers required is Nx * (Ny + 1), so 15 in the case
of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare
register helps. But maybe it is just a matter of how things get
loaded into the cache. On the 7950X I did try 3 x 8 and it did
not perform as well as 5 x 5.
2024-06-22 12:02:50 +03:00
Kawrakow
9dba81ddf2 iqk_mul_mat: better fp16 for AVX2
Basically use what I did for Arm.
Improves PP performance to 141.7 t/s up from 136 t/s
on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling).
This is now 10% faster than tinyBLAS.
There is a minor improvement also on the Ryzen-5975WX
(16 vector registers, so we use 4x3 tiling): we get
138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.
2024-06-22 12:02:50 +03:00
Kawrakow
baf6aaa31b iqk_mul_mat: fp16 for Arm
~2% slower than tinyBLAS - not sure why.
2024-06-22 12:02:50 +03:00
Kawrakow
6ec0fcc5c7 iqk_mul_mat: slightly faster FANCY_SIMD dot product
About 2% faster for q4_K.
2024-06-22 12:02:50 +03:00
Kawrakow
5812618409 iqk_mul_mat: fix q8_0
I was happily using _mm256_packs_epi32() to pack the
q8_0 x q8_0 dot products back to int16_t, and getting useful
results. But theoretically this can overflow, so it is
better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine
the 4 dot products using int32_t additions. This is (almost)
as fast, unlike _mm256_hadd_epi32(), which seems excessively
slow on the Ryzen-7950X.
2024-06-22 12:02:50 +03:00
Kawrakow
7f91151c2e iqk_mul_mat: decouple from llamafile also in cmake 2024-06-22 12:02:50 +03:00
Kawrakow
8b03121c33 iqk_mul_mat: make it build with the Makefile 2024-06-22 12:02:50 +03:00
Kawrakow
c7870afaad iqk_mul_mat: use block_q8_1_x4 also for AVX2
Here the performance gain is more significant. E.g., for q4_1,
PP-512 becomes 168 t/s up from 137 t/s.
Now the performance gap to q4_0 is so significant that I
wonder if I should change to using Q8_1 also for the
qX_0 legacy quants.
2024-06-22 12:02:50 +03:00
Kawrakow
5b19e5e4a9 iqk_mul_mat: use block_q8_0_x4 also for AVX2 2024-06-22 12:02:50 +03:00
Kawrakow
30a0bf30fa iqk_mul_mat: delete unused stuff 2024-06-22 12:02:50 +03:00
Kawrakow
64da6f7a97 iqk_mul_mat: add q8_0
It was actually ready but not turned on.
Having forgotten, I made a new implementation along the
lines of the fp16 implementation (i.e., using tiling).
That matched tiinyBLAS performance. But the existing
implementation that I now turned on is faster:
PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS
TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)
2024-06-22 12:02:50 +03:00
Kawrakow
f2ced256b4 iqk_mul_mat: fp16 tweaks
Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers).
This works best for the Ryzen-5975WX.
2024-06-22 12:02:50 +03:00
Kawrakow
b4ecd2dce6 iqk_mul_mat: fp16 implementation cleanup
It turns out on my Ryzen-7950X CPU using
AVX512 is slower.
2024-06-22 12:02:50 +03:00
Kawrakow
e0b52e14a6 iqk_mul_mat: fp16 implementation for AVX2
This simple implementation beats jart's tiniBLAS by a
small margin (143 t/s vs 137 t/s for PP-512, TG is
4.75 t/s, so exactly the same as ggml).
2024-06-22 12:02:50 +03:00