Commit Graph

3229 Commits

Author SHA1 Message Date
Kawrakow
b4ecd2dce6 iqk_mul_mat: fp16 implementation cleanup
It turns out on my Ryzen-7950X CPU using
AVX512 is slower.
2024-06-22 12:02:50 +03:00
Kawrakow
e0b52e14a6 iqk_mul_mat: fp16 implementation for AVX2
This simple implementation beats jart's tiniBLAS by a
small margin (143 t/s vs 137 t/s for PP-512, TG is
4.75 t/s, so exactly the same as ggml).
2024-06-22 12:02:50 +03:00
Kawrakow
2328da1aa7 iqk_mul_mat: multi-thread quantization also for MoE models 2024-06-22 12:02:50 +03:00
Kawrakow
ea239f8572 iqk_mul_mat: make it independent of sgemm 2024-06-22 12:02:50 +03:00
Kawrakow
5039ea8930 iqk_mul_mat: minor improvements
Current performance:
| model             |       size |  threads |    test |              t/s |
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ3_S    |   2.75 GiB |       16 |   pp512 |    100.21 ± 0.32 |
| llama 7B IQ3_XXS  |   2.41 GiB |       16 |   pp512 |    105.25 ± 0.75 |
| llama 7B IQ2_M    |   2.20 GiB |       16 |   pp512 |    117.88 ± 0.15 |
| llama 7B IQ2_XS   |   1.89 GiB |       16 |   pp512 |    136.38 ± 0.24 |
| llama 7B IQ2_XXS  |   1.73 GiB |       16 |   pp512 |    128.47 ± 0.39 |
                                                     mean: 117.64
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ2_XXS  |   1.73 GiB |        8 |   tg128 |     23.94 ± 0.04 |
| llama 7B IQ2_XS   |   1.89 GiB |        8 |   tg128 |     23.27 ± 0.03 |
| llama 7B IQ2_M    |   2.20 GiB |        8 |   tg128 |     18.88 ± 0.03 |
| llama 7B IQ3_XXS  |   2.41 GiB |        8 |   tg128 |     19.07 ± 0.04 |
| llama 7B IQ3_S    |   2.75 GiB |        8 |   tg128 |     15.44 ± 0.05 |
                                                     mean:  20.12
2024-06-22 12:02:50 +03:00
Kawrakow
e85753e1ad iqk_mul_mat: no more templates in the IQ dequantizers
Also moved the quant specific code from the EvenSignHelper
into the corresponding dequantizers.

These two changes had a tiniy performance benefit (much too small
compared to what I was expecting/hoping for).
2024-06-22 12:02:50 +03:00
Kawrakow
b8556267cd iqk_mul_mat: remove template on one of the prepare() functions 2024-06-22 12:02:49 +03:00
Kawrakow
44b1b4fb97 iqk_mul_mat: experimenting with zen4
Nope, we cannot have good performance for iq2_xxs and
iq3_xxs at the same time. If I don't force inline
the sign functions, I get better performnce for iq2_xxs
and bad performance for iq3_xxs. If I fore inline them,
it is the other way around. Anyway, this is what we have
now on Zen4 for all quants with forced inline EvenSignHelper
methods:

| model            |       size | threads |   test |           t/s |
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |      16 |  pp512 | 100.91 ± 0.26 |
| llama 7B IQ3_XXS |   2.41 GiB |      16 |  pp512 | 106.08 ± 0.78 |
| llama 7B IQ2_M   |   2.20 GiB |      16 |  pp512 | 116.41 ± 0.25 |
| llama 7B IQ2_XS  |   1.89 GiB |      16 |  pp512 | 132.54 ± 1.07 |
| llama 7B IQ2_XXS |   1.73 GiB |      16 |  pp512 | 125.53 ± 0.06 |
                                    arithmetic mean: 116.29
                                    geometric  mean: 115.70
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |       8 |  tg128 |  15.69 ± 0.04 |
| llama 7B IQ3_XXS |   2.41 GiB |       8 |  tg128 |  18.02 ± 0.04 |
| llama 7B IQ2_M   |   2.20 GiB |       8 |  tg128 |  18.94 ± 0.03 |
| llama 7B IQ2_XS  |   1.89 GiB |       8 |  tg128 |  23.29 ± 0.02 |
| llama 7B IQ2_XXS |   1.73 GiB |       8 |  tg128 |  22.96 ± 0.09 |
                                    arithmetic mean:  19.78
                                    geometric  mean:  19.56

Without force-inlining, PP(iq3_xxs) drops to 98 t/s while
PP(iq2_xxs) increases to 137 t/s.
2024-06-22 12:02:49 +03:00
Kawrakow
eb9e2b628a iqk_mul_mat: experimenting with zen4 (iq2_xxs)
Observing again the wierdness of performance drop
in a quant because of a change in another quant.
After I added FANCY_SIMD implementations for
ia3_s, iq2_s and iq2_xs, I'm observing that
iq2_xxs PP performance dropped to 130 t/s from 139 t/s.
Adding FANCY_SIMD implementation for applying the signs
brings it back to 137 t/s and gives a small boost
for TG as well (23.4 vs 23.0 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
2c8d3dad1f iqk_mul_mat: experimenting with zen4 (iq2_xs) 2024-06-22 12:02:49 +03:00
Kawrakow
0d9027fe74 iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m) 2024-06-22 12:02:49 +03:00
Kawrakow
ed8f1fe490 iqk_mul_mat: small improvement for iq3_s
The same as in llamafile. We get
PP-512 = 96.6 t/s
TG-128 = 7.77 t/s @  4 threads
         14.4 t/s @  8 threads
         16.3 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
01d55dcbf0 iqk_mul_mat: better AVX2 implementation for iq2_xxs
From here on switching to GCC 12.

PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @  4 threads
          23.0 t/s @  8 threads
          25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
d4e9e595f9 iqk_mul_mat: better AVX2 implementation for iq2_xxs
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Kawrakow
41391ff4b0 iqk_mul_mat: AVX2 implementation for iq2_xxs
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Kawrakow
be132341f5 iqk_mul_mat: AVX2 implementation for iq2_xs
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Kawrakow
3c448906bf iqk_mul_mat: AVX2 implementation for iq2_s
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Kawrakow
f31200bde1 Separate templates for TG and PP for i-quants on AVX2 2024-06-22 12:02:49 +03:00
Kawrakow
3f90520d1f iqk_mul_mat: AVX2 implementation for iq3_xxs
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Kawrakow
24ccf42a4f iqk_mul_mat: AVX2 implementation for iq3_s
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Kawrakow
32f20a1b9b Cleanup - Arm i-quants should be good now
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Kawrakow
7235135c3e iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Kawrakow
482dd30382 Simplify 2024-06-22 12:02:49 +03:00
Kawrakow
6aa7ac9cd3 iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
d041c81b1d iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
3fe4e1b27c iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Kawrakow
4c0920cb1b Add Q8_0 2024-06-22 12:02:49 +03:00
Kawrakow
62122c1950 Cosmetics 2024-06-22 12:02:49 +03:00
Kawrakow
fb8bc26dc5 iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Kawrakow
a18a564e54 iqk_mul_mat: faster q3_K TG
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Kawrakow
d434b4751a iqk_mul_mat for llama.cpp 2024-06-22 12:02:49 +03:00
Clint Herron
9fa7946997 JSON Schema to GBNF integration tests (#7790)
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
2024-06-21 23:18:36 -04:00
k.h.lai
d34e2e8860 vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)
* vulkan: detect multiple devices by deviceUUID instead of deviceID

* vulkan: remove unneeded variables

* vulkan: fix id query
2024-06-21 10:28:20 +02:00
Eve
7ccc0cb46d ggml : AVX IQ quants (#7845)
* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply
2024-06-21 08:57:36 +03:00
Georgi Gerganov
46e0320612 llama : optimize long word tokenization with WPM (#8034)
ggml-ci
2024-06-21 08:51:28 +03:00
Douglas Hanley
a895a1b78e llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
2024-06-21 08:38:22 +03:00
Shuichi Tsutsumi
7ab016f973 swiftui : enable stream updating (#7754) 2024-06-21 08:30:58 +03:00
Hamdoud Hakem
4fb22fa139 requirements : Bump torch and numpy for python3.12 (#8041) 2024-06-20 22:01:15 +02:00
Hamdoud Hakem
e767e20fc6 convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) 2024-06-20 21:59:59 +02:00
Johannes Gäßler
5b4e0a2a38 common: fix warning (#8036)
* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-20 16:40:13 +02:00
luoyu-intel
20a2d77aa2 [SYCL] Fix windows build and inference (#8003)
* add sycl preset

* fix debug link error. fix windows crash

* update README
2024-06-20 21:19:05 +08:00
Johannes Gäßler
24dfdbb1a3 CUDA: stream-k decomposition for MMQ (#8018)
* CUDA: stream-k decomposition for MMQ

* fix undefined memory reads for small matrices
2024-06-20 14:39:21 +02:00
Michael de Gans
4f46967577 metal : fix ggml_metal_supports_op for BF16 (#8021)
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
2024-06-20 08:32:01 +03:00
sasha0552
c7d9dd7634 server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
Michael de Gans
9d63d2b978 un-ignore build-info.cmake and build-info.sh (#7996)
* un-ignore `build-info.cmake` and `build-info.sh`

I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.

* un-ignore `build-info.cpp.in`

For the same reason as the previous two files.

* Reorganize `.gitignore`

* Add exceptions for files mentioned by @slaren

I did leave .clang-tidy since it was explicitly ignored before.

* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted

* Remove `.clang-tidy` from `.gitignore`

Per comment by @ggerganov

* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
2024-06-19 22:10:42 +02:00
slaren
028d6b31c6 ggml : synchronize threads using barriers (#7993) 2024-06-19 15:04:15 +02:00
Georgi Gerganov
efc3d09e43 codecov : remove (#8004) 2024-06-19 13:04:36 +03:00
Meng, Hengyu
ce37982f07 [SYCL] refactor (#6408)
* seperate lower precision GEMM from the main files

* fix workgroup size hardcode
2024-06-19 09:11:51 +08:00
jaime-m-p
b8114be2fd tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
2024-06-18 18:40:52 +02:00
Sigbjørn Skjæret
083d5edc87 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00