ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-04 13:30:47 +00:00

Author	SHA1	Message	Date
staviq	380d2fca81	server : support for saving templates in browser LocalStorage (#2486 ) * support for templates in browser LocalStorage * sync accepted #2409 fix from upstream * convert autosave invocation to useEffect * Apply suggestions from code review Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> * Regen index.html.cpp, suggested from code review --------- Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>	2023-08-18 07:34:01 +08:00
Johannes Gäßler	0fc9b45a9a	README: fix LLAMA_CUDA_MMV_Y documentation (#2647 )	2023-08-17 23:57:59 +02:00
Henri Vasserman	dfd2bf9187	[Zig] Fixing Zig build and improvements (#2554 ) * Fix zig after console.o was split * Better include and flag management * Change LTO to option	2023-08-17 23:11:18 +03:00
Kerfuffle	aad3ce7fee	Add --cfg-negative-prompt-file option for examples (#2591 ) Add --cfg-negative-prompt-file option for examples	2023-08-17 07:29:44 -06:00
Georgi Gerganov	3340c5d564	llama : replace (permute + reshape + view_1d) with (view_3d) (#2538 ) ggml-ci	2023-08-17 10:47:09 +03:00
drbh	747ac45a45	tests : adds simple llama grammar tests (#2618 ) * adds simple llama grammar tests * fix lint and add Makefile * 0 terminate code_points * avoid dangling pointers in candidate cleanup * cleanup grammar at end of test	2023-08-17 10:41:01 +03:00
Shouzheng Liu	783bffe0ba	ggml-alloc : fix discrepency between measure&eval (#2639 ) The GGML memory allocator consistently places a tensor within the optimal-fit memory block, which is the smallest block capable of accommodating the tensor's size. During the measurement phase, the final block is generously sized, ensuring it never qualifies as the optimal-fit block as long as there exists another block capable of accommodating the tensor. Nevertheless, in the evaluation phase, the last block is constrained in size and could potentially qualify as the optimal-fit block. Consequently, there exists the possibility of a tensor being allocated to a different region during evaluation, leading to more memory fragmentation in our scratch buffer. This recent commit guarantees uniform behavior of the allocator across both the measurement and evaluation phases, eliminating discrepancies between the two.	2023-08-17 10:35:53 +03:00
Kolen Cheung	33a7556b2e	cmake : install ggml-meta.metal if LLAMA_METAL (#2449 )	2023-08-16 23:09:49 +03:00
Jhen-Jie Hong	f5f80b4078	metal : print error of load pipeline state (#2564 ) * metal : print error of load pipeline state * metal : return null if load pipeline failed	2023-08-16 23:09:03 +03:00
Shouzheng Liu	3e347c09df	metal : enable ggml-alloc (#2627 ) * metal: enable ggml-alloc Make ggml-alloc work with concurrently dispatch. * style-fix Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-08-16 23:08:28 +03:00
Shouzheng Liu	d47e62c9da	metal : matrix-matrix multiplication kernel (#2615 ) * metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.	2023-08-16 23:07:04 +03:00
Georgi Gerganov	7c50c2c95a	scripts : add helper script to get wikitext	2023-08-15 10:05:25 +03:00
Jhen-Jie Hong	87be3696f6	server : add missing /json-schema-to-grammar.mjs (#2616 ) fixes #2611	2023-08-15 06:14:14 +08:00
Jhen-Jie Hong	1aa7b274c7	metal : return null instead of exit(1) (#2573 )	2023-08-14 16:37:39 +03:00
Cheng Shao	bb04b1c694	server : add --numa support (#2524 )	2023-08-14 16:36:42 +03:00
Kamil Tomšík	f638f57333	llama : add missing enum keyword in function signatures (#2610 )	2023-08-14 16:35:16 +03:00
Johannes Gäßler	677c32921e	CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596 )	2023-08-14 10:41:22 +02:00
Jhen-Jie Hong	16aaeef5b3	server : fix default grammar by use empty string in the UI (#2604 )	2023-08-14 16:20:17 +08:00
Jhen-Jie Hong	4fce312113	server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588 ) * server : implement json-schema-to-grammar.mjs by follow python impl * server : add grammar support in chat.mjs * server : implement grammer param in the UI * server : generate .hpp * server : remove trailing whitespaces * server : generate .hpp * server : fix sort of prop pairs * server : optimize regex & iteration	2023-08-14 15:16:54 +08:00
vxiiduu	67cd356f19	Enhance Windows 7 and below compatibility. (#2592 ) * Enhance Windows 7 compatibility. * Clean away unnecessary preprocessor conditional	2023-08-13 20:59:16 -07:00
drbh	56208c9293	test : add simple grammar parsing tests (#2594 ) * adds simple grammar parsing tests * adds cassert header	2023-08-13 17:00:48 +03:00
Johannes Gäßler	e1305c82c3	CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590 )	2023-08-13 00:24:45 +02:00
byte-6174	bb9ebb4394	Adding support for llama2.c models (#2559 )	2023-08-12 01:17:25 +02:00
Equim	ae2377983b	server: fixed wrong variable name in timing json (#2579 ) * server: fixed wrong variable name in timing json * remove redunct entry	2023-08-12 00:35:14 +02:00
DannyDaemonic	17defbc44e	Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.	2023-08-10 13:11:36 -07:00
Christian Demsar	c366b01720	Add --n-predict -2 for stopping generation on full context (#2565 )	2023-08-10 16:28:27 +02:00
Martin Krasser	17d5b4bfbe	Fix grammar-based sampling issue in server (#2566 )	2023-08-10 13:16:38 +03:00
Sam Spilsbury	186283cb2d	ggml-alloc: Don't try to re-use buffers of external tensors (#2562 ) * ggml-alloc: Don't try to re-use buffers of external tensors They might be weights that came from another context, so we have no control over them (and they might be re-used elsewhere so writing to them would be a bad idea). * ggml-alloc: >= when checking for out-of-bounds Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-08-09 22:47:42 +02:00
grahameth	70b6d930bd	add log_callback to llama_context_params for custom logging. (#2234 ) * add log_callback to llama_context_params for custom logging. * Fix macro expansion on gcc * Add struct llama_state for global variables and move log_callback there * Turn log level into enum and some minor changes. * Remove model_for_logging parameter (not needed anymore) * Convert remaining fprintf(stderr, ...) calls to use new macros. * Fix enum and initialize g_state * Fix log calls after merge * Fix missing static * Add back all the new lines in the logging strings * Add comment for llama_log_callback and replace remaining printf calls --------- Co-authored-by: grahameth <-> Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>	2023-08-09 22:46:40 +02:00
Johannes Gäßler	ade4f55531	CUDA: tuned mul_mat_q kernels (#2546 )	2023-08-09 09:42:34 +02:00
Martin Krasser	220649a37f	Allow passing grammar to completion endpoint (#2532 ) * Allow passing grammar to completion endpoint	2023-08-08 16:29:19 +03:00
Johannes Gäßler	fe11463cef	CUDA: tighter VRAM scratch size for 65b/70b (#2551 )	2023-08-08 14:38:16 +02:00
chaihahaha	932dac4a50	llm.vim : multiline autocompletion, get rid of "^@" (#2543 )	2023-08-08 15:07:02 +03:00
Georgi Gerganov	74f72ea675	vim : bring back simple llm.vim example	2023-08-08 15:06:18 +03:00
AustinMroz	d517b2f07c	vim : streaming and more (#2495 ) * Update Vim plugin * Remove getbufoneline usage, Add input bind example. getbufoneline() appears to be a recently added function and has been replaced with getbufline for compatibility. An additional example that explains how to add a keybind that works in insert mode was added.	2023-08-08 14:44:48 +03:00
klosax	fb17afa861	Add --rope-scale parameter (#2544 ) * common.cpp : Add --rope-scale parameter * README.md : Add info about using linear rope scaling	2023-08-07 19:07:19 +02:00
Georgi Gerganov	0248ddcded	ggml : mul mat tweaks (#2372 ) * ggml : mul mat wip ggml-ci * ggml : alternative thread distribution for mul_mat ggml-ci * ggml : mul_mat block tiling attempt * ggml : mul_mat threads yield ggml-ci	2023-08-07 14:25:58 +03:00
Georgi Gerganov	0208ddf702	ggml : pad result of ggml_nbytes()	2023-08-07 14:24:42 +03:00
Georgi Gerganov	d922c47542	ggml : change params pointer (style change) (#2539 ) ggml-ci	2023-08-07 13:55:18 +03:00
Georgi Gerganov	f3ee043e24	ggml : sync (custom ops) (#2537 ) ggml-ci	2023-08-07 13:20:09 +03:00
Johannes Gäßler	19cf204509	Fixed mmap prefetch for GPU offloading (#2529 )	2023-08-07 10:09:40 +02:00
Georgi Gerganov	0637bae56a	metal : fix out-of-bounds access + inc concurrency nodes (#2416 ) * metal : fix out-of-bounds access + style changes * metal : increase concurrency nodes to 2*GGML_MAX_NODES	2023-08-07 10:52:57 +03:00
GiviMAD	c39e3477fd	[Makefile] Move ARM CFLAGS before compilation (#2536 )	2023-08-07 09:21:46 +03:00
Henri Vasserman	8cc4ff2f1a	[Zig] Rewrite build for Zig 0.11 (#2514 ) * zig build fixes * Disable LTO on Windows.	2023-08-07 08:35:53 +03:00
DannyDaemonic	28fe11b65e	console : fix issue related to Windows 11 PowerShell console mode persistence (#2521 )	2023-08-06 09:49:34 +03:00
Keiichi Tabata	49fe9844e7	convert.py : add missing abstract methods for quantized data (#2491 )	2023-08-06 09:34:05 +03:00
Johannes Gäßler	d5060f4590	CUDA: faster k-quant mul_mat_q kernels (#2525 )	2023-08-05 18:20:44 +02:00
Jonas Wunderlich	e45dbc8951	fix firefox autoscroll (#2519 )	2023-08-04 22:16:11 +02:00
Cebtenzzre	6ae47d1968	server: regenerate completion.js.hpp (#2515 )	2023-08-04 21:00:57 +02:00
Cebtenzzre	bcf5f2e1d8	CUDA: use min compute capability of GPUs actually used (#2506 )	2023-08-04 17:35:22 +02:00

1 2 3 4 5 ...

999 Commits