Commit Graph

869 Commits

Author SHA1 Message Date
liuwei-git
c1a6ad7577 llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Olivier Chafik
287fa980b8 grammars: fix resampling logic regression (#7424) 2024-05-21 20:40:00 +01:00
Amir
e205f11bbc examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
jaime-m-p
0dbe001317 Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
49a32c0167 Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Johannes Gäßler
a2a24aec6f perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
Georgi Gerganov
31b2d6e05b server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
Georgi Gerganov
c930c28bec server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov
9cc3a7c871 server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Georgi Gerganov
8a5e27cbd7 tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
Fred Douglas
f43b1eb190 quantize : fix --keep-split check (#7374) 2024-05-19 19:37:04 +03:00
Johannes Gäßler
a742a54fd0 server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
Johannes Gäßler
9ae757d0b5 server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Georgi Gerganov
beb87b0aed cmake : update android comments (#7341) 2024-05-19 11:01:01 +03:00
Georgi Gerganov
ae3045fe3f android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler
97cd158809 server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
strawberrymelonpanda
048941c1ee perplexity : ndot progress and show stats with < 100 tasks (#7348)
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
Radoslav Gerganov
da03acb735 rpc : set SO_REUSEADDR for the server socket (#7320)
ref: #7293
2024-05-17 17:25:44 +03:00
Radoslav Gerganov
d0b0e30ad0 server : add support for the RPC backend (#7305)
ref: #7292
2024-05-17 10:00:17 +03:00
Leon Knauer
7762ef55ba [Server] Added --verbose option to README [no ci] (#7335) 2024-05-17 10:11:03 +10:00
Pierrick Hymbert
99d7a99cef Revert "server bench: fix bench not waiting for model load (#7284)" (#7334)
This reverts commit 583fd6b000.
2024-05-16 20:43:45 +02:00
Radoslav Gerganov
fe34112740 rpc : get available mem for the CPU backend
This can be overridden with the -m command line option

ref: #7293
2024-05-16 12:04:08 +03:00
Radoslav Gerganov
0b19253ad5 rpc : add command line arg for specifying backend memory
ref: #7293
2024-05-16 09:58:29 +03:00
Vaibhav Srivastav
756bbb6560 doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)
* chore: add references to the quantisation space.

* fix grammer lol.

* Update README.md

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
slaren
f3f290e25e ggml : tag ggml_tensor::backend as deprecated (#7290) 2024-05-15 15:08:48 +02:00
dm4
0b96b615fc embedding : free the batch after execution (#7297) 2024-05-15 15:01:12 +03:00
Johannes Gäßler
cbd77f7974 server bench: fix bench not waiting for model load (#7284) 2024-05-15 08:44:16 +02:00
Steve Grubb
ef87ca916e server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Brian
e115936501 Revert "move ndk code to a new library (#6951)" (#7282)
This reverts commit efc8f767c8.
2024-05-14 16:10:39 +03:00
Radoslav Gerganov
af81b28dbf ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Elton Kola
8536fbb140 move ndk code to a new library (#6951) 2024-05-14 17:30:30 +10:00
Ryuei
dc91e1430e docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
k.h.lai
e0baf1aca7 llava-cli: fix base64 prompt (#7248) 2024-05-14 00:02:36 +10:00
Johannes Gäßler
2d2147923e perplexity: add BF16 vs. FP16 results (#7150) 2024-05-13 13:03:27 +02:00
Benjamin Findley
9b9b4eb946 change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Xuan Son Nguyen
a00bd6e049 fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00
Steve Grubb
2a4289e314 server : free llama_batch on exit (#7212)
* [server] Cleanup a memory leak on exit

There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.

* make tab into spaces
2024-05-11 11:13:02 +03:00
Johannes Gäßler
70a18260b2 server: fix reported top tokens for temperature 0 (#7203) 2024-05-11 10:11:28 +02:00
Joan Fontanals
bbd8009868 llama : add Jina Embeddings architecture (#6826)
* feat: first things to do

* feat: create tensors for Jina architecture

* fix: use other tensors

* feat: embedding gets results

* fix: fix usage of ALIBI

* fix: clean prints

* fix: do some cleanup unused vars

* fix: revert changes to Makefile and CMakeLists

* fix: revert some changes

* fix: fix small detail

* fix: fix convert formatting

* fix: fix linting and editor

* feat: set proper vocab settings

* fix: JinaBertForMaskedLM registration

* feat: support q_normalization and k_normalization in Jina arch

* feat: handle gpt2 tokenizer with Jina architecture

* feat: example comments in embedding

* feat: rename Jina Bert to Jina Bert V2

* fix: add some changes as per review

* feat: proper KQ_pos for Jina embeddings

* feat: add capacity to load models ES and DE for Spanish

* llama : fix pre-tokenizers

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* minor : clean-up

* embedding : add warning about missing SEP

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-11 10:46:09 +03:00
slaren
541cbc7081 llama-bench : add pp+tg test type (#7199) 2024-05-10 18:03:54 +02:00
Justine Tunney
e7cdac9331 Fix memory bug in grammar parser (#7194)
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug

    ./llamafile -m foo.gguf -p bar --grammar 'root::="'

Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
2024-05-10 21:01:08 +10:00
HanishKVC
46b54e6d5c Main+: optionally allow special tokens from user in interactive mode (#7097)
@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
2024-05-10 20:21:58 +10:00
Andrei
aa76c59db1 llava : fix moondream support (#7163)
* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36a.

* Fix num_positions and embeddings initialization
2024-05-10 09:41:10 +03:00
slaren
abad588dfd eval-callback : fix conversion to float (#7184) 2024-05-10 01:04:12 +02:00
Ahmet Zeer
f1d4e6faea TypoFix (#7162) 2024-05-09 10:16:45 +02:00
compilade
282537cab8 convert-hf : save memory with lazy evaluation (#7075)
* convert-hf : begin refactoring write_tensor

* convert : upgrade to sentencepiece v0.2.0

* convert-hf : remove unused n_dims in extra_*_tensors

* convert-hf : simplify MoE weights stacking

* convert-hf : flake8 linter doesn't like semicolons

* convert-hf : allow unusual model part names

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.

* convert : use a string for the SentencePiece tokenizer path

* convert-hf : display tensor shape

* convert-hf : convert norms to f32 by default

* convert-hf : sort model part names

`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".

* convert-hf : use an ABC for Model again

It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.

* convert-hf : use a plain class for Model, and forbid direct instantiation

There are no abstract methods used anyway,
so using ABC isn't really necessary.

* convert-hf : more consistent formatting of cmdline args

* convert-hf : align the message logged for converted tensors

* convert-hf : fix Refact conversion

* convert-hf : save memory with lazy evaluation

* convert-hf : flake8 doesn't like lowercase L as a variable name

* convert-hf : remove einops requirement for InternLM2

* convert-hf : faster model parts loading

Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.

* convert-hf : minor changes for consistency

* gguf-py : add tqdm as a dependency

It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
2024-05-08 18:16:38 -04:00
Johannes Gäßler
4e708b6d16 JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143) 2024-05-08 21:53:08 +02:00
Georgi Gerganov
93ad6ba476 Revert "llava : add support for moondream vision language model (#6899)"
This reverts commit 46e12c4692.
2024-05-08 22:14:39 +03:00
JohnnyB
eebaec0788 server : add themes + favicon (#6848)
* Added themes support with two sample themes and a favicon.

* Newline

* Newline

* Newline

* Trailing whitespace

* Increased opacity for contrast

* Increase opacity.

Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY

* Opacity action trigger.

Trying to re-trigger the cancelled action.

* One more opacity adjustment

This Actions pipeline is failing for random issues.

* Delete examples/server/themes/buttons_top/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Replaced underscore.
2024-05-08 22:12:06 +03:00
Dawid Potocki
212982e41d main : add --conversation / -cnv flag (#7108) 2024-05-08 17:32:32 +03:00