Commit Graph

2173 Commits

Author SHA1 Message Date
Daniel Bevenius
787dbf4d0f llava : update surgery script to not remove tensors (#5536)
This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.

The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.

I did not perform this change for BakLLaVA models as I am not sure
how that part works.
2024-02-18 18:19:23 +02:00
Kawrakow
fa40433c9d 1.5 bit quantization (#5453)
* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-18 18:16:55 +02:00
github-actions[bot]
b02abf3383 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
  → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
2024-02-18 06:39:58 -08:00
Georgi Gerganov
a5c73a0e9d ggml : add ALiBi support for ggml_soft_max_ext (#5488)
* ggml : avoid recomputing alibi slopes (CPU)

* llama : reuse hparams.f_max_alibi_bias in all cases

ggml-ci

* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)

ggml-ci

* ggml : handle all SRCs (do not break on first null)

ggml-ci

* tests : do not use slope for large soft_max

accumulates too much error

ggml-ci

* ggml : alternative ALiBi without extra tensor

We compute the slopes in the kernel

ggml-ci

* cuda : add ALiBi support in ggml_soft_max_ext

ggml-ci

* ggml : deprecate ggml_alibi

* ggml : support multi-sequence ALiBi (Metal)

ggml-ci

* cuda : add multi-seq ALiBi + remote F16 soft_max

ggml-ci

* ggml : update deprecation message

* ggml : fix pos ptr when no ALiBi

ggml-ci

* cuda : fix performance (pow -> powf)

* cuda : precompute ALiBi constants

* metal : pre-compute ALiBi slopes

ggml-ci

* llama : init kq_pos only if needed

ggml-ci

* test-backend-ops : add null pos test to soft_max

test-backend-ops : replace soft_max tests

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-17 23:04:16 +02:00
Ananta Bastola
27a984488c ci : add an option to fail on compile warning (#3952)
* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-17 23:03:14 +02:00
clibdev
5c46b0e0bc gitignore : update for CLion IDE (#5544) 2024-02-17 18:28:37 +02:00
Georgi Gerganov
8f20d4fb3e cmake : fix VULKAN and ROCm builds (#5525)
* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci
2024-02-16 19:05:56 +02:00
Georgi Gerganov
b6ebf5b6ff scripts : add helpers script for bench comparing commits (#5521)
* scripts : add helpers script for bench comparing commits

* scripts : detect CUDA

* set flags after checking the command line

* fix make flags

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-16 15:14:40 +02:00
Herman Semenov
26bfe98833 llava : removed excess free(NULL) operation (#5531) 2024-02-16 14:43:23 +02:00
Herman Semenov
f54ed222bf llama : minor fixed return int value (#5529) 2024-02-16 13:45:48 +02:00
Alexey Parfenov
14c96c2c4c server : add "samplers" param to control the samplers order (#5494) 2024-02-16 13:33:25 +02:00
Rőczey Barnabás
f183700c25 server : fix system prompt cli (#5516) 2024-02-16 12:00:56 +02:00
bmwl
4bc5d852a2 ggml : add numa options (#5377)
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-16 11:31:07 +02:00
Daniel Bevenius
384e6bbbe7 llava : fix clip-model-is-vision flag in README.md (#5509)
* llava: fix clip-model-is-vision flag in README.md

This commit fixes the flag `--clip_model_is_vision` in README.md which
is does not match the actual flag:
```console
$ python convert-image-encoder-to-gguf.py --help
...
  --clip-model-is-vision
                        The clip model is a pure vision model
                        (ShareGPT4V vision extract for example)
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llava: update link to vit config in README.md

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-16 11:24:39 +02:00
Georgi Gerganov
adfe875b17 ci : fix BERT model download and convert 2024-02-16 09:57:55 +02:00
Douglas Hanley
9bc28075c1 Use correct type of pooling for embedding models (#5500)
Use correct type of pooling for embedding models
2024-02-15 12:21:49 -05:00
Georgi Gerganov
fc7f903883 clip : fix wrong loop condition 2024-02-15 18:49:08 +02:00
slaren
a8d3e145cc cuda : print message when initialization fails (#5512)
* cuda : print message when initialization fails

* use CUDA_NAME both times
2024-02-15 16:49:01 +01:00
Georgi Gerganov
31e77c5029 scripts : add hf.sh helper script (#5501)
* scripts : add hf.sh helper scripts

* hf : add error logs

* hf : add support for --repo and --file
2024-02-15 15:41:15 +02:00
Michaël de Vries
e56fe7b5ca fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487)
* fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false

* fix(gguf-py): added missing cls and mask token ids to the gguf metadata
2024-02-15 14:14:37 +01:00
Elbios
e4880f18f3 llava : fix memory management bug (#5491)
* Fix memory management in llava and server code

Fixes this error:

llama_new_context_with_model: graph splits (measure): 3
Available slots:
 -> Slot 0 - max context: 6000
{"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 - loaded image
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 1]
munmap_chunk(): invalid pointer
Aborted

* Make it cleaner by checking size in batch free wrapper
2024-02-15 10:01:57 +02:00
John
66faa7f5d2 llaba : hotfix for llava-1.6 image number (#5495)
Co-authored-by: John <cmt-nct@users.noreply.github.com>
2024-02-15 09:59:18 +02:00
Neuman Vong
a5579e4f93 vulkan: Find optimal memory type but with fallback (#5381)
* @0cc4m feedback

* More feedback @0cc4m
2024-02-15 07:11:15 +01:00
Rune
de06c28a15 readme : fix typo (#5490)
executabhle -> executable
2024-02-14 17:15:49 +02:00
John
07687d7350 llava : update README.md (#5489)
* Update README.md

* Update README.md

* Update examples/llava/README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 16:49:42 +02:00
Michael Podvitskiy
f5c9547e67 cmake : ARM intrinsics detection for MSVC (#5401) 2024-02-14 10:49:01 +02:00
John
4351148229 llava : support v1.6 (#5267)
* Create llava-survery-v2.py

* Update convert-image-encoder-to-gguf.py

* Update convert-image-encoder-to-gguf.py

* Rename llava-survery-v2.py to llava-surgery-v2.py

* Update convert-image-encoder-to-gguf.py

will now search for projector

* Update convert-image-encoder-to-gguf.py

whoops

* Update llava-surgery-v2.py

* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening

* whitespace corrections

* ws

* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.

* ws

* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli

* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed

* ws

* convert : skip unknown tensors (need for LLaVA)

* llava : update readme

* llava : fix compile warnings

* llava : style

* convert : add --skip-unknown CLI arg

* server : remove clip structs

* bugfix for non llava-1.6

It should now work with llava-1.5 as well

* clip : minor code rearrange

* llava : update readme a bit

---------

Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 09:38:35 +02:00
AT
0aa506cb97 Early return for zero size calls to get_tensor. (#5482)
* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Early return after the assertions.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <treat.adam@gmail.com>

---------

Signed-off-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-13 22:44:25 +01:00
John
3872f9b23f gguf : add python reader example (#5216)
* Update CMakeLists.txt

* Create reader.py

* Update reader.py

* Update reader.py

another whitespace :|

* Update reader.py

* lintlintlint
2024-02-13 19:56:38 +02:00
Jared Van Bortel
f79cb016f4 llama : add support for Nomic Embed (#5468) 2024-02-13 12:03:53 -05:00
Aarni Koskela
d03fb522e5 llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478)
* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs
2024-02-13 18:18:16 +02:00
Aarni Koskela
604cdd2d78 llama : make load error reporting more granular (#5477)
Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.
2024-02-13 15:24:50 +02:00
Daniel Bevenius
49cc8b2c60 finetune : rename feed-forward tensors (w1/w2/w3) (#4839)
* finetune: rename feed-forward tensors (w1/w2/w3)

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* train-text-from-scratch: rename ff tensors

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-13 15:15:42 +02:00
Georgi Gerganov
b8667a8447 tests : multi-thread the tokenizer tests (#5474)
* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci
2024-02-13 15:14:22 +02:00
Douglas Hanley
f54932ce66 llama : support batched embeddings (#5466)
* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-13 14:06:58 +02:00
Johannes Gäßler
990dfea10c make: add error message for bad CUDA version (#5444)
* make: add error message for bad CUDA version

* Update Makefile

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-13 12:38:37 +01:00
Georgi Gerganov
a6747dccc5 bert : add tests + fix quantization (#5475)
* llama : do not quantize pos embd and token type tensors

* ci : add BERT tests

ggml-ci

* ci : do not do BERT tests on low-perf nodes

ggml-ci
2024-02-13 13:01:29 +02:00
Georgi Gerganov
fea77814c7 tests : disable moe test (#5473) 2024-02-13 11:20:24 +02:00
Kawrakow
a0074519b4 ggml-quants : fix compiler warnings (shadow variable) (#5472)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-13 09:07:57 +02:00
Georgi Gerganov
fc01cc08e5 llama : fix quantization when tensors are missing (#5423) 2024-02-12 20:14:39 +02:00
Georgi Gerganov
6858a4fa83 swift : package no longer use ggml dependency (#5465)
* Revert "swift : update Package.swift to use ggml as dependency (#4691)"

This reverts commit ece9a45e8f.

* spm : add ggml headers
2024-02-12 19:54:29 +02:00
Lee
6a83825ce0 py : fix persimmon n_rot conversion (#5460)
* convert : fix persimmon offical weight conversion to write correct n_rot.

* Update convert-persimmon-to-gguf.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-12 19:29:57 +02:00
Abhilash Majumder
16dba45e9c ggml-sycl: Replace 3d ops with macro (#5458)
* use macro

* use macro

* fix format
2024-02-12 20:22:05 +05:30
Daniel Bevenius
5270994e6d llava : remove prog parameter from ArgumentParser (#5457)
* llava: remove prog parameter from ArgumentParser

This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.

The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* ci: add W503 to flake8 ignore list

This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-12 10:38:44 +02:00
Georgi Gerganov
0ca4e0c14c sync : ggml (#5452)
* ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

* sync : ggml

* update llama.cpp, clip.cpp, export-lora.cpp

* update finetune.cpp, train-text-from-scratch.cpp

ggml-ci

* ggml-backend : reduce alignment to 32 to match gguf and fix mmap

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-12 09:16:06 +02:00
Johannes Gäßler
8377d606b8 CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)
* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-11 19:08:39 +01:00
Douglas Hanley
2dc6eef6da Add support for BERT embedding models (#5423)
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11 11:21:38 -05:00
github-actions[bot]
713ce99422 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
  → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
2024-02-11 07:50:41 -08:00
Sergio López
68e7918281 vulkan: only use M-sized matmul on Apple GPUs (#5412)
* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <slp@redhat.com>

* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-02-11 15:12:00 +01:00
Alexey Parfenov
f2f1af418b common : use enums for sampler types (#5418)
* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11 15:43:31 +02:00