Commit Graph

1871 Commits

Author SHA1 Message Date
Georgi Gerganov
b9d3d4307d llama : check LLAMA_TRACE env for extra logging (#4929)
* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci
2024-01-14 13:26:53 +02:00
Georgi Gerganov
97e013bd09 scripts : sync-ggml-am.sh option to skip commits 2024-01-14 11:08:41 +02:00
Georgi Gerganov
ccd0d21f80 llama : use LLAMA_LOG_ macros for logging 2024-01-14 11:03:19 +02:00
Kawrakow
1fb783d4df Fix ffn_down quantization mix for MoE models (#4927)
* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:53:39 +02:00
Alex Azarov
6c92cc5bb7 metal : correctly set SIMD support flags on iOS (#4923)
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS
2024-01-14 10:44:39 +02:00
Karthik Kumar Viswanathan
e77c6d2b23 llama : support WinXP build with MinGW 8.1.0 (#3419) 2024-01-14 10:41:44 +02:00
Kawrakow
f6d2b332f6 2-bit quantizations (#4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
Kawrakow
b56c35d2ef Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:44:30 +02:00
Georgi Gerganov
2e51f37f77 sync : ggml 2024-01-14 00:14:46 +02:00
Johannes Gäßler
f2426af97b ggml: cache sin/cos for RoPE (#4908) 2024-01-13 21:41:37 +01:00
Georgi Gerganov
2329ddc378 metal : remove old API (#4919)
ggml-ci
2024-01-13 20:45:45 +02:00
Georgi Gerganov
8db1eb43ce server : fix prompt caching with system prompt (#4914) 2024-01-13 19:31:26 +02:00
Georgi Gerganov
cc46b61450 llama : fix detokenization of non-special added-tokens (#4916)
Co-authored-by: goerch <jhr.walter@t-online.de>
2024-01-13 18:47:38 +02:00
Georgi Gerganov
1b05b8ef88 metal : disable log for loaded kernels (#4794) 2024-01-13 18:46:37 +02:00
David Friehs
9bee33a18c llama : minimize size used for state save/load (#4820)
* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes
2024-01-13 18:29:43 +02:00
Someone
0930ffa61d workflows: unbreak nix-build-aarch64, and split it out (#4915)
The fix should be just the `sudo apt-get update`
2024-01-13 16:29:16 +00:00
Yann Follet
cd34c8a010 main : add parameter --no-display-prompt (#4541)
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens

* remove empty line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13 18:09:08 +02:00
texmex76
6bc73ad492 gguf : fix potential infinite for-loop (#4600)
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-13 18:06:20 +02:00
Georgi Gerganov
ac78d391cb metal : refactor kernel loading code (#4794)
* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci
2024-01-13 18:03:45 +02:00
Johannes Gäßler
7079775d2b compare-llama-bench: tweak output format (#4910) 2024-01-13 15:52:53 +01:00
Ziad Ben Hadj-Alouane
a51d98e6ba server : fix deadlock that occurs in multi-prompt scenarios (#4905)
* * fix deadlock

* * dont ruint all whitespace
2024-01-13 16:20:46 +02:00
makomk
a4a5e25c95 server : fix crash with multimodal models without BOS token (#4904) 2024-01-13 16:16:11 +02:00
Georgi Gerganov
cc5f0d2aeb convert : update phi-2 to latest HF repo (#4903)
* convert : update phi-2 to latest HF repo

ggml-ci

* py : try to fix flake stuff
2024-01-13 13:44:37 +02:00
Georgi Gerganov
211567ae1a sync : ggml 2024-01-12 22:02:43 +02:00
Georgi Gerganov
8a8a831cef ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)
* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix
2024-01-12 22:02:11 +02:00
slaren
b103cfe655 backend_sched : fix assignments
ggml-ci
2024-01-12 22:02:11 +02:00
Maximilian Winter
9342050de3 examples : add pydantic models to GBNF grammar generator (#4883)
* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.
2024-01-12 21:46:45 +02:00
Johannes Gäßler
50f828eab9 CUDA: faster q8_0 -> f16 dequantization (#4895) 2024-01-12 20:38:54 +01:00
slaren
882a16a127 llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Georgi Gerganov
ab969b80d4 llama : remove redundant assert for StableLM (#4901) 2024-01-12 20:54:12 +02:00
Daniel Bevenius
e55262208d export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12 19:54:53 +02:00
Zay
9905daaaa3 llama.swiftui : update models layout (#4826)
* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout
2024-01-12 14:48:00 +02:00
Georgi Gerganov
cda6c08cb5 gitignore : imatrix 2024-01-12 14:33:21 +02:00
Johannes Gäßler
d80c163451 CUDA: fix softmax compile for old CUDA versions (#4862) 2024-01-12 12:30:41 +01:00
Georgi Gerganov
060cb10825 llama : fix typo "imp_embd" -> "inp_embd" 2024-01-12 13:11:15 +02:00
howlger
55dcfe3b5a common : streamline the formatting of help (#4890)
* common : streamline the formatting of help

- Separate alternative parameters by a comma

- Do not indent `--version` differently

* Update common/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 13:05:32 +02:00
Georgi Gerganov
6cd8a0d214 py : fix lint (#4889) 2024-01-12 13:03:38 +02:00
Georgi Gerganov
9d4439a38d llama : fix llm_build_k_shift to use correct n_rot (#4889)
* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot
2024-01-12 13:01:56 +02:00
Kawrakow
b8e769ee21 Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 06:59:57 +01:00
Georgi Gerganov
599288bf8a server : fix infill when prompt is empty (#4833) 2024-01-11 23:23:49 +02:00
Georgi Gerganov
d12d72c0a2 main : better name for variable n_print (#4874) 2024-01-11 22:46:26 +02:00
Georgi Gerganov
2e19c63956 main : disable token count by default (#4874) 2024-01-11 22:43:05 +02:00
Georgi Gerganov
6936a37c2b swift : track ggml release branch (#4867) 2024-01-11 21:58:28 +02:00
Kawrakow
a9d5db805b llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00
Kawrakow
2757620588 ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:39:39 +02:00
Georgi Gerganov
8b546126e4 swift : pin ggml commit + remove ggml.h from spm-headers (#4878)
ggml-ci
2024-01-11 21:31:31 +02:00
Laura
026f72d14b server : implement credentialed CORS (#4514)
* Implement credentialed CORS according to MDN

* Fix syntax error

* Move validate_api_key up so it is defined before its first usage
2024-01-11 20:02:48 +02:00
Michael Coppola
2fce0d62ba server : support for multiple api keys (#4864)
* server: added support for multiple api keys, added loading api keys from file

* minor: fix whitespace

* added file error handling to --api-key-file, changed code to better
reflect current style

* server: update README.md for --api-key-file

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11 19:51:17 +02:00
Behnam M
1760ce4a1d server : add LOG_INFO when model is successfully loaded (#4881)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

* used LOG_INFO after successful model loading
2024-01-11 19:41:39 +02:00
Someone
855d77b3ce ci: nix-flake-update: new token with pr permissions (#4879)
* ci: nix-flake-update: new token with pr permissions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 17:22:34 +00:00