Commit Graph

831 Commits

Author SHA1 Message Date
Xuan Son Nguyen
4d5cc65823 server : defer tasks when "slot unavailable" (#5018)
* server: defer task when no slot is available

* remove unnecessary log

---------

Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
Georgi Gerganov
d23cd25dd0 imatrix : fix assert for src0 non-cont check 2024-01-18 21:45:51 +02:00
Georgi Gerganov
82b820d593 perplexity : fix winogrande N tasks option 2024-01-18 20:49:00 +02:00
Kawrakow
c7bb8e340a HellaSwag: speed up by parallelizing log-prob evaluation (#5020)
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 19:18:21 +02:00
Georgi Gerganov
f46c1ab2da perplexity : faster HellaSwag via batching (#5017)
* perplexity : faster HellaSwag

ggml-ci

* perplexity : clean-up

ggml-ci

* perplexity : no need for decode_helper

ggml-ci

* perplexity : add comments

* perplexity : option to specify max batched tasks via `n_parallel`

* perplexity : remove HellaSwag restruction for n_batch
2024-01-18 15:33:01 +02:00
Kawrakow
cc90bc53e3 Add Winogrande evaluation (#5015)
* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 13:46:27 +02:00
Georgi Gerganov
2369070475 imatrix : offload to GPU support (#4957)
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* imatrix : offload to GPU support

* imatrix : fix ggml_mul_mat_id hanlding

ggml-ci

* ci : add imatrix test

ggml-ci

* ci : rearrange output

ggml-ci
2024-01-17 18:46:30 +02:00
Daniel Bevenius
0001889453 finetune : add training data file to log message (#4979)
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.

The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 19:54:24 +02:00
Maximilian Winter
09e6d92a82 examples : add complete parallel function calling example (#4974) 2024-01-16 19:41:42 +02:00
Georgi Gerganov
f985a114ca perplexity : fix kv cache handling for hellaswag (#4981)
ggml-ci
2024-01-16 19:34:54 +02:00
Neuman Vong
0641f5e160 android : introduce starter project example (#4926)
* Introduce starter project for Android

Based on examples/llama.swiftui.

* Add github workflow

* Set NDK version

* Only build arm64-v8a in CI

* Sync bench code

* Rename CI prop to skip-armeabi-v7a

* Remove unused tests
2024-01-16 15:47:34 +02:00
Maximilian Winter
c53c0e8d1a examples : fix and improv docs for the grammar generator (#4909)
* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.

* Fixed some issues and bugs of the grammar generator. Imporved Documentation

* Update pydantic_models_to_grammar.py
2024-01-16 14:10:48 +02:00
Daniel Bevenius
dd7b94d290 finetune : use LLAMA_FILE_MAGIC_GGLA (#4961)
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 13:14:19 +02:00
stduhpf
d55fdc7382 speculative : threading options (#4959)
* speculative: expose draft threading

* fix usage format

* accept -td and -tbd args

* speculative: revert default behavior when -td is unspecified

* fix trailing whitespace
2024-01-16 13:04:32 +02:00
Kawrakow
4629f415ed Add ability to use importance matrix for all k-quants (#4930)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 16:21:12 +02:00
Kawrakow
f6d2b332f6 2-bit quantizations (#4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
Georgi Gerganov
2329ddc378 metal : remove old API (#4919)
ggml-ci
2024-01-13 20:45:45 +02:00
Georgi Gerganov
8db1eb43ce server : fix prompt caching with system prompt (#4914) 2024-01-13 19:31:26 +02:00
David Friehs
9bee33a18c llama : minimize size used for state save/load (#4820)
* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes
2024-01-13 18:29:43 +02:00
Yann Follet
cd34c8a010 main : add parameter --no-display-prompt (#4541)
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens

* remove empty line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13 18:09:08 +02:00
Ziad Ben Hadj-Alouane
a51d98e6ba server : fix deadlock that occurs in multi-prompt scenarios (#4905)
* * fix deadlock

* * dont ruint all whitespace
2024-01-13 16:20:46 +02:00
makomk
a4a5e25c95 server : fix crash with multimodal models without BOS token (#4904) 2024-01-13 16:16:11 +02:00
Maximilian Winter
9342050de3 examples : add pydantic models to GBNF grammar generator (#4883)
* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.
2024-01-12 21:46:45 +02:00
slaren
882a16a127 llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Daniel Bevenius
e55262208d export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12 19:54:53 +02:00
Zay
9905daaaa3 llama.swiftui : update models layout (#4826)
* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout
2024-01-12 14:48:00 +02:00
Kawrakow
b8e769ee21 Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 06:59:57 +01:00
Georgi Gerganov
599288bf8a server : fix infill when prompt is empty (#4833) 2024-01-11 23:23:49 +02:00
Georgi Gerganov
d12d72c0a2 main : better name for variable n_print (#4874) 2024-01-11 22:46:26 +02:00
Georgi Gerganov
2e19c63956 main : disable token count by default (#4874) 2024-01-11 22:43:05 +02:00
Kawrakow
a9d5db805b llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00
Laura
026f72d14b server : implement credentialed CORS (#4514)
* Implement credentialed CORS according to MDN

* Fix syntax error

* Move validate_api_key up so it is defined before its first usage
2024-01-11 20:02:48 +02:00
Michael Coppola
2fce0d62ba server : support for multiple api keys (#4864)
* server: added support for multiple api keys, added loading api keys from file

* minor: fix whitespace

* added file error handling to --api-key-file, changed code to better
reflect current style

* server: update README.md for --api-key-file

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11 19:51:17 +02:00
Behnam M
1760ce4a1d server : add LOG_INFO when model is successfully loaded (#4881)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

* used LOG_INFO after successful model loading
2024-01-11 19:41:39 +02:00
pudepiedj
948a870c14 main : print total token count and tokens consumed so far (#4874)
* Token count changes

* Add show token count

* Updating before PR

* Two requested changes

* Move param def posn
2024-01-11 18:14:52 +02:00
Isaac McFadyen
d08d46765f server : fix typo in model name (#4876) 2024-01-11 16:33:26 +02:00
Behnam M
751a33212c server : update readme to document the new /health endpoint (#4866)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too
2024-01-11 09:12:05 +02:00
Georgi Gerganov
a2e0602ac0 server : fix build + rename enums (#4870) 2024-01-11 09:10:34 +02:00
Behnam M
fe3d53f647 server : add a /health endpoint (#4860)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line
2024-01-10 21:56:05 +02:00
John
239c336728 clip : support more quantization types (#4846)
Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants.
This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations.

I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
2024-01-10 15:37:09 +02:00
Justine Tunney
acc4879612 llava-cli : don't crash if --image flag is invalid (#4835)
This change fixes an issue where supplying `--image missing-file` would
result in a segfault due to a null pointer being dereferenced. This can
result in distracting info being printed if robust crash analysis tools
are being used.
2024-01-09 19:59:14 +02:00
Behnam M
b680381cfd server : update readme about token probs (#4777)
* updated server readme to reflect the gg/server-token-probs-4088 commit

added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`.

* simplified the `completion_probabilities` JSON schema 

It's now easier to understand what the structure of `completion_probabilities` looks like.

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09 12:02:05 +02:00
Zsapi
2d0d38f5e0 server : add api-key flag to documentation (#4832)
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-09 11:12:43 +02:00
Georgi Gerganov
3e86f86432 llama.swiftui : update readme 2024-01-08 15:57:36 +02:00
Georgi Gerganov
7ddf5857e7 main : add self-extend support (#4815)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* llama : "self-extend"-like context extension

* passkey : add comment

* main : add Self-Extend support

* llama : add comment about llama_kv_cache_seq_div
2024-01-08 11:18:32 +02:00
Georgi Gerganov
a386b0dd63 examples : add passkey test (#3856)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme
2024-01-08 11:14:04 +02:00
slaren
d513cfc4b5 llama-bench : add no-kv-offload parameter (#4812) 2024-01-07 17:59:01 +01:00
Alex Azarov
30df691a96 llama.swiftui : use llama.cpp as SPM package (#4804) 2024-01-07 10:20:50 +02:00
Alex Azarov
8c36aaf5a8 llama.swiftui : add visionOS target (#4805) 2024-01-07 09:46:55 +02:00
Georgi Gerganov
003f85d7ea server : fix n_predict check (#4798) 2024-01-07 08:45:26 +02:00