Commit Graph

2647 Commits

Author SHA1 Message Date
Daniel Bevenius
0a48a50efe gguf : add option to not check tensor data (#6582)
This commit adds an option to the gguf example to not check the tensor
data.

The motivation for this is that it can be nice to use the gguf tool to
read other .gguf files that were not created by the gguf tool.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-10 21:16:48 +03:00
Ralph Soika
f343ac422b minor layout improvements (#6572)
* minor layout improvements

* added missing file, run deps.sh locally
2024-04-10 19:18:25 +02:00
slaren
f5fab5862b llama : add model types for mixtral (#6589) 2024-04-10 17:24:14 +02:00
slaren
6dcf19070c convert.py : add consolidated.safetensors for mixtral 8x22b (#6587) 2024-04-10 15:23:12 +02:00
Pierrick Hymbert
7e0ca3abb2 docs : how to add a model (#6565)
* docs: how to add a model

* docs: model: typo and docs

* docs: model: add prevision on RoPE

* docs: model: rephrasing README.md

* docs: model: rephrasing README.md

* docs: model: README.md fix trailing spaces

* docs : some fixes

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-10 09:58:48 +03:00
Artem Zinnatullin
d885d8c317 readme : fix ROCm link (#6579) 2024-04-10 09:49:12 +03:00
sjxx
3a4635aa6f readme : update UI list (#6560) 2024-04-10 09:34:00 +03:00
Jiří Sejkora
6c55775ebb readme: fix typo in amdgpu target name (#6573) 2024-04-10 00:23:02 +02:00
Jared Van Bortel
ac727da885 BERT tokenizer fixes (#6498)
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
2024-04-09 13:44:08 -04:00
Georgi Gerganov
61e36e1532 sync : ggml 2024-04-09 20:29:06 +03:00
Ed Lee
f63aee5c16 server : detect search query to start webchat (#6554) 2024-04-09 10:31:47 +02:00
Carolinabanana
78aade9c48 llama : add Command R Plus support (#6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 11:16:13 +03:00
Georgi Gerganov
41c01483dd license : update copyright notice + add AUTHORS (#6405)
* license : add AUTHORS

* authors : update

* scipts : add LICENSE and gen-authors.sh to sync
2024-04-09 09:23:19 +03:00
Georgi Gerganov
fa18a55bd8 llama : fix attention layer count sanity check (#6550)
* llama : fix attention layer count sanity check

* llama : fix parentheses in attention layer count sanity check

There was otherwise a warning when compiling.

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-04-08 22:25:49 +03:00
kunnis
2324e00feb Comment explaining a decision (#6531) 2024-04-08 17:44:19 +02:00
Georgi Gerganov
f6d42b6366 quantize : fix precedence of cli args (#6541) 2024-04-08 16:23:01 +03:00
Rick G
d365f350b5 llama : support negative ith in llama_get_ API (#6519)
* llama_sampling_sample with default args is more naively usable

* Batches populated by either llama_batch_get_one or llama_batch_add work with default args
  * Previously get_one could use the default argument
  * Previously add should usually have used the last index where logits[idx] == true
* This hopefully encourages the use of llama_batch_add
  * By giving expected results when using default arguments.
* Adds "negative indexing" feature to llama_get_logits_ith and llama_get_embeddings_ith
* Believed to work with any currently well behaved program
  * Default arg now works for both cases (previously would give strange results for add case)
  * Any non-negative number is unaffected and behaves as previously
  * Negative arguments were previously invalid.
* Implemented as a special case of indexing as suggested by @compilade in https://github.com/ggerganov/llama.cpp/pull/6519

* Fixed mismatch type errors

* cited in macOS CI tests
* Missed in original updates based on PR feedback in https://github.com/ggerganov/llama.cpp/pull/6519
2024-04-08 16:02:30 +03:00
Jan Boon
c7959808aa llama : save and restore kv cache for single seq id (#6341)
* llama : save and restore kv cache for single seq id

* remove trailing whitespace

* respond error in case there's no space in the kv cache

* add kv seq save restore to test case

* add --slot-save-path arg to enable save restore and restrict save location

* Returning 0 for some cases, instead of asserting.

* cleanup error cases

* rename sequence state functions

* rename state get set functions

* add previous function names back in with DEPRECATED notice

* update doc

* adjust endpoints to preferred style

* fix restoring zero cell count

* handle seq rm return value

* unused param

* keep in the size check

* fix return types

* add server test case for slot save restore

* cleanup

* add cake

* cleanup style

* add special

* removing a whole sequence never fails

* move sequence state file functionality from server to llama to match session api and add version tags

* catch exceptions on save as well

* error log messages

* check types for stricter restore

* update server doc

* readme : update API changes date

* strict filename validation

* move include, reject bom as well

* also reject empty filename

* reject whitespace and trailing dot

---------

Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-08 15:43:30 +03:00
Abhilash Majumder
262b8ddd2c remove row=1 cond (#6532) 2024-04-08 16:26:01 +08:00
Firat
5417e85bbb Adding KodiBot to UI list (#6535)
KodiBot is free and open source ai chat app released under the GNU General Public License.
2024-04-08 09:48:29 +02:00
Mark Fairbairn
0b06becdd7 Change Windows AMD example to release build to make inference much faster. (#6525) 2024-04-07 20:52:19 +02:00
Georgi Gerganov
9bcd3315a1 flake.lock: Update (#6517)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
  → 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d' (2024-04-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
  → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib' (2024-03-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29)
  → 'github:NixOS/nixpkgs/fd281bd6b7d3e32ddfa399853946f782553163b5' (2024-04-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-04-07 11:25:30 -07:00
DAN™
9b88711670 Add GritLM as supported models. (#6513) 2024-04-07 19:33:59 +02:00
Georgi Gerganov
2c80dc319b sync : ggml 2024-04-07 17:05:51 +03:00
Slava Primenko
0f3cd80c8f ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)
`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1

See this issue for more details:
https://github.com/ggerganov/examples/whisper/whisper.cpp/issues/2007
2024-04-07 17:05:40 +03:00
Georgi Gerganov
9638aa3727 scripts : sync ggml-cuda folder 2024-04-07 16:08:12 +03:00
limitedAtonement
16daf2d791 Run make to build the project (#6457) 2024-04-07 13:05:40 +02:00
Neo Zhang Jianyu
05a887dfdd support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (#6521) 2024-04-07 10:55:59 +08:00
Georgi Gerganov
0bcde06402 sync : ggml 2024-04-06 18:27:46 +03:00
Daniel Bevenius
7017659e90 backend : fix typo in scheduler documentation (ggml/781)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-06 17:42:26 +03:00
Clint Herron
5b69ee6ce5 Tests: Added integration tests for GBNF parser (#6472)
* Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements.

* Fixing whitespace errors and cleaning error message alert to be clearer.

* Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API.

* Comment cleanup.

* Reorganizing tests for readability.

* Cleaning up debug message to make a bit more sense.
2024-04-06 10:31:33 -04:00
Pierrick Hymbert
73e01781d4 ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
2024-04-06 05:40:47 +02:00
Brian
8ab3d0da25 gguf.py : add licence and version to gguf writer (#6504) 2024-04-05 21:41:38 +03:00
Hoang Nguyen
2f6d0324a5 readme : update UI list (#6503)
* Add MindMac to UI list

* Update proprietary description

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-04-05 21:39:43 +03:00
Ting Sun
bcd1e3beaf bench : make n_batch and n_ubatch configurable in Batched bench (#6500)
* bench: make n_batch and n_ubatch configurable

* bench: update doc for batched bench
2024-04-05 21:34:53 +03:00
Ouadie EL FAROUKI
3c00380ea7 [SYCL] Fixed minor bug when enabling FP16 for non intel targets (#6464)
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper)

* Update ggml-sycl.cpp

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

---------

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-05 19:05:06 +05:30
alexpinel
6285695777 readme : add Dot to UI list (#6487) 2024-04-04 13:22:50 -04:00
Jun Jie
cd082c6713 readme : fix typo (#6481) 2024-04-04 13:16:37 -04:00
Ed Lepedus
a57ef1110e server: add cURL support to server Dockerfiles (#6474)
* server: add cURL support to `full.Dockerfile`

* server: add cURL support to `full-cuda.Dockerfile` and `server-cuda.Dockerfile`

* server: add cURL support to `full-rocm.Dockerfile` and `server-rocm.Dockerfile`

* server: add cURL support to `server-intel.Dockerfile`

* server: add cURL support to `server-vulkan.Dockerfile`

* fix typo in `server-vulkan.Dockerfile`

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-04 18:31:22 +02:00
Minsoo Cheong
016aa58b11 ci: exempt master branch workflows from getting cancelled (#6486)
* ci: exempt master branch workflows from getting cancelled

* apply to bench.yml
2024-04-04 18:30:53 +02:00
Ewout ter Hoeven
7d32d7d775 build CI: Name artifacts (#6482)
Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP.

It might be possible to further simplify the packing step (in future PRs).
2024-04-04 17:08:55 +02:00
Shakhar Dasgupta
deefac27f2 server: allow penalizing repetition of newlines on server webpage (#6431) 2024-04-04 17:03:00 +02:00
Pierrick Hymbert
fd66566ee1 ci: bench fix concurrency for workflow trigger dispatch with sha1 (#6478) 2024-04-04 16:59:04 +02:00
limitedAtonement
997c0854b4 Correct README link (#6458)
README is called README.md.
2024-04-04 16:30:02 +02:00
Pierrick Hymbert
4cf330fb1e ci: bench: add more ftype, fix triggers and bot comment (#6466)
* ci: bench: change trigger path to not spawn on each PR

* ci: bench: add more file type for phi-2: q8_0 and f16.
- do not show the comment by default

* ci: bench: add seed parameter in k6 script

* ci: bench: artefact name perf job

* Add iteration in the commit status, reduce again the autocomment

* ci: bench: add per slot metric in the commit status

* Fix trailing spaces
2024-04-04 12:57:58 +03:00
Daniel Bevenius
6d074393e8 common: remove duplicate check for curl (#6471)
This commit removes one of the two identical checks for curl being NULL
in llama_load_model_from_url.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-04 09:49:21 +02:00
Clint Herron
9179276b55 examples : add GBNF validator program (#5948)
* Revising GBNF validator program to be much simpler.

* Changing from streams to using cstdio

* Adding final newline character.
2024-04-04 10:44:28 +03:00
Georgi Gerganov
af0871e8a7 server : remove obsolete --memory-f32 option 2024-04-04 09:34:58 +03:00
Xiao-Yong Jin
abdfc39ec8 server : add option to disable KV offload (#6468) 2024-04-04 09:33:48 +03:00
Clint Herron
d7a430215b convert : fix for lint error complaining of bare except (#6470) 2024-04-04 09:32:53 +03:00