Nigel Bosch
ac6575162d
Handle null rope scaling value ( #2793 )
2023-08-26 14:11:17 +02:00
klosax
976b621020
Fix spm whitespaces ( #2806 )
...
* llama.cpp : fix spm whitespace escaping + clean up
* main.cpp : spm - add whitespace in front of prompt
* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
2023-08-26 13:45:53 +02:00
lon
7ede4319fc
examples : skip unnecessary external lib in server README.md how-to ( #2804 )
2023-08-26 16:07:43 +08:00
Marcus Dunn
f19ed06ed0
llama : fix struct decl ( #2790 )
2023-08-25 19:17:15 +03:00
Kawrakow
198140c2aa
Faster perplexity computation ( #2786 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2023-08-25 19:05:02 +03:00
Matt Pulver
3e0b38e027
llama : add llama_beam_search() ( #2267 )
...
* Add llama_beam_search().
* Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token().
* Add space around * pointers and & references.
* Add spaces around comparison and assignment operators.
* Prefer west const.
* Use llama_ prefix for structs in global namespace.
* Delete obsolete comment from an earlier revision.
* Change eos to eob in llama_beam and llama_beam_view structs.
2023-08-25 18:18:48 +03:00
Nigel Bosch
8ef2a0c9d3
convert.py : Get rope scale from HuggingFace models ( #2772 )
...
* Get rope scale from HF models
* Save rope scale only for linear scaling
* Rewrite for clarity
2023-08-25 16:41:52 +02:00
slaren
89e4a4461e
llama-bench : add model sizes ( #2771 )
...
* llama-bench : add model sizes
* more compact markdown output
* back to GiB
* adjust column sizes
2023-08-25 15:16:19 +02:00
slaren
e20b657ffb
convert.py : export rope freq_base when converting CodeLlama from an HF model ( #2773 )
2023-08-25 14:08:53 +02:00
Jhen-Jie Hong
b06380dcc0
server : display token probabilities in the UI ( #2489 )
...
* server : add n_probs param in chat UI
* server : keep message data array & show in probabilites component
* server : add simple popover component
* server : fix completion_probabilities undefined if not set n_probs
* server : implement Probabilites
* server : handle bytes
* server : make n_probs max to 10 for easy scroll
* server : adjust for dark/light mode
* server : Fix regenerated prompt
* server : update index.html.hpp
* server : convert prob to percentage + show original value as div title
* server : fix Probabilites not used if included empty str
* server : skip byte pair in display probabilites
* server : remove array check of completion_probabilities in messages
* skip empty array or byte pair (> 1) in Probabilites
* generate index.html.hpp
* fix incorrect prob convert if the str is already a known token
* use final response to show probabilities on stop
* revert unnecessary change
* correct probabilites usage
* remove unused function
* always send partial response for get correct probs of last to_send
* fix typo
* fix content of format_final_response
* refactor probs render & make pColor transparent if not found
* send empty string when got stop_pos in partial
* avoid unnecessary empty data event & send rest of partial tokens on stop
* use <br /> for new line
* skip -1 tok in loop to avoid send '' on end
* trim last new lines on stop
* revert unnecessary change
2023-08-25 18:32:45 +08:00
Georgi Gerganov
6224f81799
ci : pip install gguf in editable mode ( #2782 )
...
ggml-ci
2023-08-25 13:03:25 +03:00
M. Yusuf Sarıgöz
08a1012230
gguf : export objects to user code ( #2780 )
...
* gguf export more objects to user code
* gguf export all objects to user code for now
* gguf : bump version
2023-08-25 12:43:41 +03:00
Henri Vasserman
984b7495ed
ROCm Port ( #1087 )
...
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5 )
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP
---------
Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com >
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com >
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com >
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com >
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com >
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com >
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com >
2023-08-25 12:09:42 +03:00
Georgi Gerganov
40c8c6dd6f
cuda : add RoPE kernel for mode == 2 (NeoX) ( #2760 )
...
* cuda : add RoPE kernel for mode == 2 (NeoX)
* falcon : do not offload the embeddings layer
2023-08-25 11:55:59 +03:00
M. Yusuf Sarıgöz
a67ec14fe9
gguf : make gguf pip-installable
...
* gitignore : add dist and rm pyproject.toml
* gguf: prepare as Pip package
* gguf: prepare as Pip package
* gguf : fix line endings
* requirements : add gguf
* gguf : update readme with build notes
* gguf : update readme with build notes
* gguf : add notes for tests
2023-08-25 09:26:05 +03:00
Shouzheng Liu
800ef93db4
ggml-alloc : enlarge size of parse_seq ( #2776 )
...
Since we also store barriers in this array, we need to double its size.
2023-08-25 08:58:00 +03:00
Marcus Dunn
1e8200c3d0
Added enum to llama_token_get_type return type ( #2774 )
2023-08-24 23:49:30 +02:00
slaren
506fd81d05
convert.py : try to determine n_ctx automatically for CodeLlama ( #2770 )
2023-08-24 21:10:39 +02:00
slaren
9818be3377
gguf : add rope_freq_base parameter for CodeLlama ( #2769 )
2023-08-24 21:04:05 +03:00
Georgi Gerganov
9042737101
falcon : write file type
2023-08-24 19:58:30 +03:00
Shouzheng Liu
dc51c17e4c
metal : bug-fix when enable ggml-alloc ( #2757 )
...
* metal: better memory alloc w/ concurrency dispatch
The ggml-alloc should only free tensors at memory barriers.
* ggml-alloc: avoid return silently
In certain cases, the allocate_node() function may silently return
without performing any memory allocation.
2023-08-24 19:27:25 +03:00
Georgi Gerganov
8b08abe24f
convert : auto-determine model name based on dir + scripts update
2023-08-24 19:26:47 +03:00
Kerfuffle
2a2645fd76
Fix for main example getting stuck when -n -2 and --interactive ( #2767 )
...
* Fix for main example getting stuck when -n -2 and --interactive
* Add a comment so future generations may suffer less.
2023-08-24 10:11:13 -06:00
slaren
3b743a5340
fix convert.py for codellama, add llama 34B to the list of recognized models ( #2768 )
2023-08-24 17:44:11 +02:00
DannyDaemonic
a74a205f64
Tag release with build number ( #2732 )
...
* Modified build.yml to use build number for release
* Add the short hash back into the tag
* Prefix the build number with b
2023-08-24 15:58:02 +02:00
Georgi Gerganov
25399c1197
metal : add Q8_0 support ( #2763 )
...
* metal : add dequantize_q8_0 kernel
* metal : add mul_mat_q8_0_f32 kernel
* metal : add Q8_0 mul_mm kernel
2023-08-24 16:19:57 +03:00
Georgi Gerganov
96e9fad81f
llama : escape all U+2581 in a string ( #2750 )
2023-08-24 12:26:01 +03:00
Evan Jones
f4102e260a
llama : fix grammar sometimes generating null char ( #2756 )
2023-08-24 07:07:13 +03:00
Georgi Gerganov
fc84c48240
readme : fix link
2023-08-23 23:44:19 +03:00
Georgi Gerganov
1fac3b2c0b
minor : fix trailing whitespace
2023-08-23 23:43:00 +03:00
Georgi Gerganov
eb5bf4480c
readme : update hot topics
2023-08-23 23:41:16 +03:00
Georgi Gerganov
5faba0e8a3
llm : add Falcon support ( #2717 )
...
* llama : refactor GGUF constants into static maps
* llama : check if model architecture is known
* llama : refactor llama_model_load_internal()
* gguf : add KV constant maps
* llm : read arch-specific KVs
* convert : add dummy scores + types
* falcon : load tensor data (CPU only)
* llama : fix loading progress bar
* llama : add arch member to llama_model
* falcon : CPU inference working
* falcon : support non-40B models
* falcon : minor
* llama : minor updates
ggml-ci
* convert-falcon-hf-to-gguf.py : fix special token mapping
* llama.cpp : llama default UNK token = id 0
* llama.cpp : fix bpe tokenizer
* llama.cpp : fix the fix of bpe tokenizer
* ggml : pass eps to ggml_norm
* metal : implement RoPE (mode = 2) + avoid ggml_repeat
* ggml : ggml_repeat always creates new tensor
* falcon : copy-paste self-attention from LLaMA
* metal : print extra compute pipeline info
* falcon : minor changes (still chasing the Metal problem)
* llama.cpp : fix linefeed token
* metal : fix GELU kernel numerical stability by using precise::tanh
* metal : temporary workaround for the concurrency optimization bug
* falcon : add CUDA offloading (#2739 )
* llama : better model naming and size reporting
* llama : prep new tokenizer support
* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation
* llama : remove oboslete comment
ggml-ci
* common : remove obsolete BPE API + disable test-tokenizer-1
* llama : revert BPE special-case in llama_byte_to_token()
* cuda : add TODOs for RoPE NeoX implementation
* llama : default special tokens based on vocab type
* perplexity : add log for start of tokenization
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com >
Co-authored-by: slaren <slarengh@gmail.com >
2023-08-23 23:08:04 +03:00
Georgi Gerganov
4e35149e03
minor : fix trailing whitespace
2023-08-23 22:37:39 +03:00
Olivier Chafik
7bdaf1f167
examples : restore the functionality to import llama2.c models ( #2685 )
...
* Fix import of llama2.c models that don't share weights between embedding layers
* llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv
* llama2.c: comment out legacy "load from ggml model" logic
* llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin
2023-08-23 22:33:05 +03:00
slaren
fb68e9c4e4
fix convert-lora-to-ggml.py ( #2738 )
2023-08-23 16:46:54 +02:00
klosax
ae180e1cec
main : insert bos if no tokens ( #2727 )
...
* main.cpp : insert bos if no tokens
* Update examples/main/main.cpp
* Update examples/main/main.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2023-08-23 16:46:03 +02:00
akawrykow
e2108e57c8
gitignore : fix for windows ( #2729 )
2023-08-23 17:31:34 +03:00
Cebtenzzre
557d5f9edf
chmod : make scripts executable ( #2675 )
2023-08-23 17:29:09 +03:00
JohnnyB
61c5da152b
devops : RPM Specs ( #2723 )
...
* Create llama-cpp.srpm
* Rename llama-cpp.srpm to llama-cpp.srpm.spec
Correcting extension.
* Tested spec success.
* Update llama-cpp.srpm.spec
* Create lamma-cpp-cublas.srpm.spec
* Create lamma-cpp-clblast.srpm.spec
* Update lamma-cpp-cublas.srpm.spec
Added BuildRequires
* Moved to devops dir
2023-08-23 17:28:22 +03:00
Kawrakow
6d9174f956
Fix values shown in the quantize tool help ( #2735 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2023-08-23 12:57:12 +03:00
Kawrakow
0a7ab80b61
Strided perplexity ( #2714 )
...
* Implementing strided computation of perplexity
* Alternative way to output PPL results
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2023-08-23 12:56:42 +03:00
IgnacioFDM
047c0403c4
Fix ggml to gguf conversion on Windows ( #2733 )
...
This fixes `RuntimeWarning: overflow encountered in long_scalars`
Credit: anon (not mine)
2023-08-23 03:31:09 -06:00
Xiao-Yong Jin
58656653b1
server : allow json array in prompt or content for direct token input ( #2306 )
...
* server: allow json array in prompt or content
We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.
This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.
With this, we can use EOS and BOS used in llama-2-chat models.
* server: use tokenizePrompt(json) and default "" if empty prompt
* server: fix prompt check
* server: tokenize endpoint no longer adds BOS
2023-08-23 15:12:12 +08:00
Evan Jones
943bf8930c
docs : add grammar docs ( #2701 )
...
* docs : add grammar docs
* tweaks to grammar guide
* rework GBNF example to be a commented grammar
2023-08-22 21:01:57 -04:00
Kerfuffle
0ef7086455
Improve handling of special tokens in GGML to GGUF converter ( #2725 )
...
* Improve UNK, BOS, EOS token handling when converting without metadata.
* Allow importing as a module.
* Remove some obsolete code and minor cleanups.
* Set default UNK token mapping from -1 to 0 in llama.cpp
* Try to handle overflow due to buggy Windows Python with a better error message
2023-08-22 17:39:39 -06:00
goerch
d916cb3d85
llama : fix whitespace escaping in tokenizer ( #2724 )
2023-08-23 00:10:42 +03:00
Johannes Gäßler
466a79f7b4
CUDA: use mul_mat_q kernels by default ( #2683 )
2023-08-22 22:47:05 +02:00
Alex Petenchea
c358145028
convert.py : clarifying error message ( #2718 )
2023-08-22 21:58:16 +03:00
Jiahao Li
946bf0ad96
Fix CUDA softmax by subtracting max value before exp ( #2665 )
2023-08-22 20:27:06 +02:00
Georgi Gerganov
45b45614c0
gguf : add ftype meta info to the model ( #2710 )
...
* llama : add ftype meta info to the model
ggml-ci
* convert.py : add ftype when converting (does not work)
* convert.py : fix Enum to IntEnum
ggml-ci
2023-08-22 20:05:59 +03:00