Commit Graph

3227 Commits

Author SHA1 Message Date
Iwan Kawrakow
8e072bbba3 iqk_mul_mat: multi-thread quantization also for MoE models 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
667bd4759c iqk_mul_mat: make it independent of sgemm 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
2ee56b4f0d iqk_mul_mat: minor improvements
Current performance:
| model             |       size |  threads |    test |              t/s |
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ3_S    |   2.75 GiB |       16 |   pp512 |    100.21 ± 0.32 |
| llama 7B IQ3_XXS  |   2.41 GiB |       16 |   pp512 |    105.25 ± 0.75 |
| llama 7B IQ2_M    |   2.20 GiB |       16 |   pp512 |    117.88 ± 0.15 |
| llama 7B IQ2_XS   |   1.89 GiB |       16 |   pp512 |    136.38 ± 0.24 |
| llama 7B IQ2_XXS  |   1.73 GiB |       16 |   pp512 |    128.47 ± 0.39 |
                                                     mean: 117.64
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ2_XXS  |   1.73 GiB |        8 |   tg128 |     23.94 ± 0.04 |
| llama 7B IQ2_XS   |   1.89 GiB |        8 |   tg128 |     23.27 ± 0.03 |
| llama 7B IQ2_M    |   2.20 GiB |        8 |   tg128 |     18.88 ± 0.03 |
| llama 7B IQ3_XXS  |   2.41 GiB |        8 |   tg128 |     19.07 ± 0.04 |
| llama 7B IQ3_S    |   2.75 GiB |        8 |   tg128 |     15.44 ± 0.05 |
                                                     mean:  20.12
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
0ad646b9f0 iqk_mul_mat: no more templates in the IQ dequantizers
Also moved the quant specific code from the EvenSignHelper
into the corresponding dequantizers.

These two changes had a tiniy performance benefit (much too small
compared to what I was expecting/hoping for).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
e35a14ff5f iqk_mul_mat: remove template on one of the prepare() functions 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
e67626533c iqk_mul_mat: experimenting with zen4
Nope, we cannot have good performance for iq2_xxs and
iq3_xxs at the same time. If I don't force inline
the sign functions, I get better performnce for iq2_xxs
and bad performance for iq3_xxs. If I fore inline them,
it is the other way around. Anyway, this is what we have
now on Zen4 for all quants with forced inline EvenSignHelper
methods:

| model            |       size | threads |   test |           t/s |
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |      16 |  pp512 | 100.91 ± 0.26 |
| llama 7B IQ3_XXS |   2.41 GiB |      16 |  pp512 | 106.08 ± 0.78 |
| llama 7B IQ2_M   |   2.20 GiB |      16 |  pp512 | 116.41 ± 0.25 |
| llama 7B IQ2_XS  |   1.89 GiB |      16 |  pp512 | 132.54 ± 1.07 |
| llama 7B IQ2_XXS |   1.73 GiB |      16 |  pp512 | 125.53 ± 0.06 |
                                    arithmetic mean: 116.29
                                    geometric  mean: 115.70
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |       8 |  tg128 |  15.69 ± 0.04 |
| llama 7B IQ3_XXS |   2.41 GiB |       8 |  tg128 |  18.02 ± 0.04 |
| llama 7B IQ2_M   |   2.20 GiB |       8 |  tg128 |  18.94 ± 0.03 |
| llama 7B IQ2_XS  |   1.89 GiB |       8 |  tg128 |  23.29 ± 0.02 |
| llama 7B IQ2_XXS |   1.73 GiB |       8 |  tg128 |  22.96 ± 0.09 |
                                    arithmetic mean:  19.78
                                    geometric  mean:  19.56

Without force-inlining, PP(iq3_xxs) drops to 98 t/s while
PP(iq2_xxs) increases to 137 t/s.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
47ae12bbec iqk_mul_mat: experimenting with zen4 (iq2_xxs)
Observing again the wierdness of performance drop
in a quant because of a change in another quant.
After I added FANCY_SIMD implementations for
ia3_s, iq2_s and iq2_xs, I'm observing that
iq2_xxs PP performance dropped to 130 t/s from 139 t/s.
Adding FANCY_SIMD implementation for applying the signs
brings it back to 137 t/s and gives a small boost
for TG as well (23.4 vs 23.0 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
dc96d5484f iqk_mul_mat: experimenting with zen4 (iq2_xs) 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
cb063a2a20 iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m) 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
61b8cc1ff6 iqk_mul_mat: small improvement for iq3_s
The same as in llamafile. We get
PP-512 = 96.6 t/s
TG-128 = 7.77 t/s @  4 threads
         14.4 t/s @  8 threads
         16.3 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2a72d9f978 iqk_mul_mat: better AVX2 implementation for iq2_xxs
From here on switching to GCC 12.

PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @  4 threads
          23.0 t/s @  8 threads
          25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3a6e3943a8 iqk_mul_mat: better AVX2 implementation for iq2_xxs
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
60f050d610 iqk_mul_mat: AVX2 implementation for iq2_xxs
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
309e32405f iqk_mul_mat: AVX2 implementation for iq2_xs
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
8015edb3cc iqk_mul_mat: AVX2 implementation for iq2_s
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b0071de081 Separate templates for TG and PP for i-quants on AVX2 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2c8c0d0a68 iqk_mul_mat: AVX2 implementation for iq3_xxs
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
34befcaf67 iqk_mul_mat: AVX2 implementation for iq3_s
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4f53915dcb Cleanup - Arm i-quants should be good now
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4b27ade2fb iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
221a2c3807 Simplify 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
7dcca6aea7 iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
effa4448d6 iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d2ee9ab95e iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
9ac9e928d5 Add Q8_0 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3f996d0c70 Cosmetics 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d7ab97149f iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b51922530f iqk_mul_mat: faster q3_K TG
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
19c578b413 iqk_mul_mat for llama.cpp 2024-06-22 12:02:49 +03:00
Clint Herron
c5a8d4b749 JSON Schema to GBNF integration tests (#7790)
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
2024-06-21 23:18:36 -04:00
k.h.lai
557b653dc9 vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)
* vulkan: detect multiple devices by deviceUUID instead of deviceID

* vulkan: remove unneeded variables

* vulkan: fix id query
2024-06-21 10:28:20 +02:00
Eve
7d5e8777ae ggml : AVX IQ quants (#7845)
* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply
2024-06-21 08:57:36 +03:00
Georgi Gerganov
a927b0f3dd llama : optimize long word tokenization with WPM (#8034)
ggml-ci
2024-06-21 08:51:28 +03:00
Douglas Hanley
80ea089d77 llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
2024-06-21 08:38:22 +03:00
Shuichi Tsutsumi
0e64591e82 swiftui : enable stream updating (#7754) 2024-06-21 08:30:58 +03:00
Hamdoud Hakem
b1ef562bc1 requirements : Bump torch and numpy for python3.12 (#8041) 2024-06-20 22:01:15 +02:00
Hamdoud Hakem
17b291a6a5 convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) 2024-06-20 21:59:59 +02:00
Johannes Gäßler
abd894ad96 common: fix warning (#8036)
* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-20 16:40:13 +02:00
luoyu-intel
de391e4c80 [SYCL] Fix windows build and inference (#8003)
* add sycl preset

* fix debug link error. fix windows crash

* update README
2024-06-20 21:19:05 +08:00
Johannes Gäßler
d50f8897a7 CUDA: stream-k decomposition for MMQ (#8018)
* CUDA: stream-k decomposition for MMQ

* fix undefined memory reads for small matrices
2024-06-20 14:39:21 +02:00
Michael de Gans
2075a66a96 metal : fix ggml_metal_supports_op for BF16 (#8021)
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
2024-06-20 08:32:01 +03:00
sasha0552
ba58993152 server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
Michael de Gans
a7854743c5 un-ignore build-info.cmake and build-info.sh (#7996)
* un-ignore `build-info.cmake` and `build-info.sh`

I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.

* un-ignore `build-info.cpp.in`

For the same reason as the previous two files.

* Reorganize `.gitignore`

* Add exceptions for files mentioned by @slaren

I did leave .clang-tidy since it was explicitly ignored before.

* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted

* Remove `.clang-tidy` from `.gitignore`

Per comment by @ggerganov

* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
2024-06-19 22:10:42 +02:00
slaren
9c77ec1d74 ggml : synchronize threads using barriers (#7993) 2024-06-19 15:04:15 +02:00
Georgi Gerganov
a04a953cab codecov : remove (#8004) 2024-06-19 13:04:36 +03:00
Meng, Hengyu
623494a478 [SYCL] refactor (#6408)
* seperate lower precision GEMM from the main files

* fix workgroup size hardcode
2024-06-19 09:11:51 +08:00
jaime-m-p
37bef89433 tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
2024-06-18 18:40:52 +02:00
Sigbjørn Skjæret
91c188d6c2 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
jojorne
84f6de17f6 Fix no gcc pragma on Windows (#7751) 2024-06-18 22:18:32 +10:00
Ulrich Drepper
61665277af Allow compiling with CUDA without CUDA runtime installed (#7989)
On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages.  Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.

The development environment is prepared for such situations.  There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory.  Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.
2024-06-18 14:00:14 +02:00