Commit Graph

999 Commits

Author SHA1 Message Date
staviq
380d2fca81 server : support for saving templates in browser LocalStorage (#2486)
* support for templates in browser LocalStorage

* sync accepted #2409 fix from upstream

* convert autosave invocation to useEffect

* Apply suggestions from code review

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

* Regen index.html.cpp, suggested from code review

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-08-18 07:34:01 +08:00
Johannes Gäßler
0fc9b45a9a README: fix LLAMA_CUDA_MMV_Y documentation (#2647) 2023-08-17 23:57:59 +02:00
Henri Vasserman
dfd2bf9187 [Zig] Fixing Zig build and improvements (#2554)
* Fix zig after console.o was split

* Better include and flag management

* Change LTO to option
2023-08-17 23:11:18 +03:00
Kerfuffle
aad3ce7fee Add --cfg-negative-prompt-file option for examples (#2591)
Add --cfg-negative-prompt-file option for examples
2023-08-17 07:29:44 -06:00
Georgi Gerganov
3340c5d564 llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)
ggml-ci
2023-08-17 10:47:09 +03:00
drbh
747ac45a45 tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
2023-08-17 10:41:01 +03:00
Shouzheng Liu
783bffe0ba ggml-alloc : fix discrepency between measure&eval (#2639)
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.

This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
2023-08-17 10:35:53 +03:00
Kolen Cheung
33a7556b2e cmake : install ggml-meta.metal if LLAMA_METAL (#2449) 2023-08-16 23:09:49 +03:00
Jhen-Jie Hong
f5f80b4078 metal : print error of load pipeline state (#2564)
* metal : print error of load pipeline state

* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
Shouzheng Liu
3e347c09df metal : enable ggml-alloc (#2627)
* metal: enable ggml-alloc

Make ggml-alloc work with concurrently dispatch.

* style-fix

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-16 23:08:28 +03:00
Shouzheng Liu
d47e62c9da metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
2023-08-16 23:07:04 +03:00
Georgi Gerganov
7c50c2c95a scripts : add helper script to get wikitext 2023-08-15 10:05:25 +03:00
Jhen-Jie Hong
87be3696f6 server : add missing /json-schema-to-grammar.mjs (#2616)
fixes #2611
2023-08-15 06:14:14 +08:00
Jhen-Jie Hong
1aa7b274c7 metal : return null instead of exit(1) (#2573) 2023-08-14 16:37:39 +03:00
Cheng Shao
bb04b1c694 server : add --numa support (#2524) 2023-08-14 16:36:42 +03:00
Kamil Tomšík
f638f57333 llama : add missing enum keyword in function signatures (#2610) 2023-08-14 16:35:16 +03:00
Johannes Gäßler
677c32921e CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596) 2023-08-14 10:41:22 +02:00
Jhen-Jie Hong
16aaeef5b3 server : fix default grammar by use empty string in the UI (#2604) 2023-08-14 16:20:17 +08:00
Jhen-Jie Hong
4fce312113 server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588)
* server : implement json-schema-to-grammar.mjs by follow python impl

* server : add grammar support in chat.mjs

* server : implement grammer param in the UI

* server : generate .hpp

* server : remove trailing whitespaces

* server : generate .hpp

* server : fix sort of prop pairs

* server : optimize regex & iteration
2023-08-14 15:16:54 +08:00
vxiiduu
67cd356f19 Enhance Windows 7 and below compatibility. (#2592)
* Enhance Windows 7 compatibility.
* Clean away unnecessary preprocessor conditional
2023-08-13 20:59:16 -07:00
drbh
56208c9293 test : add simple grammar parsing tests (#2594)
* adds simple grammar parsing tests

* adds cassert header
2023-08-13 17:00:48 +03:00
Johannes Gäßler
e1305c82c3 CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590) 2023-08-13 00:24:45 +02:00
byte-6174
bb9ebb4394 Adding support for llama2.c models (#2559) 2023-08-12 01:17:25 +02:00
Equim
ae2377983b server: fixed wrong variable name in timing json (#2579)
* server: fixed wrong variable name in timing json

* remove redunct entry
2023-08-12 00:35:14 +02:00
DannyDaemonic
17defbc44e Handle ENABLE_VIRTUAL_TERMINAL_PROCESSING more gracefully on earlier versions of Windows. 2023-08-10 13:11:36 -07:00
Christian Demsar
c366b01720 Add --n-predict -2 for stopping generation on full context (#2565) 2023-08-10 16:28:27 +02:00
Martin Krasser
17d5b4bfbe Fix grammar-based sampling issue in server (#2566) 2023-08-10 13:16:38 +03:00
Sam Spilsbury
186283cb2d ggml-alloc: Don't try to re-use buffers of external tensors (#2562)
* ggml-alloc: Don't try to re-use buffers of external tensors

They might be weights that came from another context, so we
have no control over them (and they might be re-used elsewhere
so writing to them would be a bad idea).

* ggml-alloc: >= when checking for out-of-bounds

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-08-09 22:47:42 +02:00
grahameth
70b6d930bd add log_callback to llama_context_params for custom logging. (#2234)
* add log_callback to llama_context_params for custom logging.

* Fix macro expansion on gcc

* Add struct llama_state for global variables and move log_callback there

* Turn log level into enum and some minor changes.

* Remove model_for_logging parameter (not needed anymore)

* Convert remaining fprintf(stderr, ...) calls to use new macros.

* Fix enum and initialize g_state

* Fix log calls after merge

* Fix missing static

* Add back all the new lines in the logging strings

* Add comment for llama_log_callback and replace remaining printf calls

---------

Co-authored-by: grahameth <->
Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>
2023-08-09 22:46:40 +02:00
Johannes Gäßler
ade4f55531 CUDA: tuned mul_mat_q kernels (#2546) 2023-08-09 09:42:34 +02:00
Martin Krasser
220649a37f Allow passing grammar to completion endpoint (#2532)
* Allow passing grammar to completion endpoint
2023-08-08 16:29:19 +03:00
Johannes Gäßler
fe11463cef CUDA: tighter VRAM scratch size for 65b/70b (#2551) 2023-08-08 14:38:16 +02:00
chaihahaha
932dac4a50 llm.vim : multiline autocompletion, get rid of "^@" (#2543) 2023-08-08 15:07:02 +03:00
Georgi Gerganov
74f72ea675 vim : bring back simple llm.vim example 2023-08-08 15:06:18 +03:00
AustinMroz
d517b2f07c vim : streaming and more (#2495)
* Update Vim plugin

* Remove getbufoneline usage, Add input bind example.

getbufoneline() appears to be a recently added function and has been
replaced with getbufline for compatibility.

An additional example that explains how to add a keybind that works in
insert mode was added.
2023-08-08 14:44:48 +03:00
klosax
fb17afa861 Add --rope-scale parameter (#2544)
* common.cpp : Add --rope-scale parameter
* README.md : Add info about using linear rope scaling
2023-08-07 19:07:19 +02:00
Georgi Gerganov
0248ddcded ggml : mul mat tweaks (#2372)
* ggml : mul mat wip

ggml-ci

* ggml : alternative thread distribution for mul_mat

ggml-ci

* ggml : mul_mat block tiling attempt

* ggml : mul_mat threads yield

ggml-ci
2023-08-07 14:25:58 +03:00
Georgi Gerganov
0208ddf702 ggml : pad result of ggml_nbytes() 2023-08-07 14:24:42 +03:00
Georgi Gerganov
d922c47542 ggml : change params pointer (style change) (#2539)
ggml-ci
2023-08-07 13:55:18 +03:00
Georgi Gerganov
f3ee043e24 ggml : sync (custom ops) (#2537)
ggml-ci
2023-08-07 13:20:09 +03:00
Johannes Gäßler
19cf204509 Fixed mmap prefetch for GPU offloading (#2529) 2023-08-07 10:09:40 +02:00
Georgi Gerganov
0637bae56a metal : fix out-of-bounds access + inc concurrency nodes (#2416)
* metal : fix out-of-bounds access + style changes

* metal : increase concurrency nodes to 2*GGML_MAX_NODES
2023-08-07 10:52:57 +03:00
GiviMAD
c39e3477fd [Makefile] Move ARM CFLAGS before compilation (#2536) 2023-08-07 09:21:46 +03:00
Henri Vasserman
8cc4ff2f1a [Zig] Rewrite build for Zig 0.11 (#2514)
* zig build fixes

* Disable LTO on Windows.
2023-08-07 08:35:53 +03:00
DannyDaemonic
28fe11b65e console : fix issue related to Windows 11 PowerShell console mode persistence (#2521) 2023-08-06 09:49:34 +03:00
Keiichi Tabata
49fe9844e7 convert.py : add missing abstract methods for quantized data (#2491) 2023-08-06 09:34:05 +03:00
Johannes Gäßler
d5060f4590 CUDA: faster k-quant mul_mat_q kernels (#2525) 2023-08-05 18:20:44 +02:00
Jonas Wunderlich
e45dbc8951 fix firefox autoscroll (#2519) 2023-08-04 22:16:11 +02:00
Cebtenzzre
6ae47d1968 server: regenerate completion.js.hpp (#2515) 2023-08-04 21:00:57 +02:00
Cebtenzzre
bcf5f2e1d8 CUDA: use min compute capability of GPUs actually used (#2506) 2023-08-04 17:35:22 +02:00