ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-26 16:14:10 +00:00

Author	SHA1	Message	Date
Felix	41eb9f687e	clip : fix memory leak (#6138 )	2024-03-18 17:40:22 +02:00
slaren	d06ac1d0c1	backend : set max split inputs to GGML_MAX_SRC (#6137 )	2024-03-18 16:33:44 +01:00
Georgi Gerganov	c31ea012cc	ci : disable stale issue messages (#6126 )	2024-03-18 13:45:38 +02:00
Georgi Gerganov	440d6a226c	ci : temporary disable sanitizer builds (#6128 )	2024-03-18 13:45:27 +02:00
slaren	d7d3ffdecf	backend : offload large batches to GPU (#6083 ) * backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-03-18 11:03:04 +01:00
DAN™	4c689ebd66	common : tidy-up argument parsing (#6105 ) * Tidy-up argument parsing. * Missing ref. * common : minor * common : add static classifier --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-18 10:27:44 +02:00
Thérence	3dd6805dcf	convert : add support for CamembertModel architecture (#6119 ) Adding support for CamembertModel architecture used by : https://huggingface.co/dangvantuan/sentence-camembert-large	2024-03-18 10:17:00 +02:00
Romain D	8295e15e30	convert : use f32 outtype for bf16 tensors (#6106 ) The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion. Change the outtype to f32 to default to a lossless conversion.	2024-03-18 10:04:41 +02:00
Pierrick Hymbert	cd0c187f9a	common: llama_load_model_from_url using --model-url (#6098 ) * common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-17 19:12:37 +01:00
Georgi Gerganov	1816cecf00	ci : close all stale issues at once (#6115 )	2024-03-17 18:51:57 +01:00
GainLee	9a6ac8119a	ggml:fix finding transfer queue family index error (#6094 ) Co-authored-by: GainLee <ligen@meizu.com>	2024-03-17 18:12:22 +01:00
AmirAli Mirian	be9deffbe1	ggml : add AVX512F SIMD (#6088 )	2024-03-16 17:52:02 +02:00
Daniel Bevenius	ad084cb949	gritlm : add initial README.md (#6086 ) * gritlm: add initial README.md to examples/gritlm This commit adds a suggestion for an initial README.md for the gritlm example. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Use the `scripts/hf.sh` script to download the model file. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Fix editorconfig-checker error in examples/gritlm/README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-03-16 17:46:29 +02:00
Xuan Son Nguyen	077033ecff	readme : add wllama as a wasm binding (#6100 )	2024-03-16 17:42:08 +02:00
DAN™	3e799df80a	common : refactor nested if causing error C1061 on MSVC (#6101 ) * Refactor nested if causing error C1061 on MSVC. * Revert back and remove else's. * Add flag to track found arguments.	2024-03-16 17:39:15 +02:00
Pierrick Hymbert	2226be39d7	ci : close inactive issue with workflow (#6053 ) * issues: ci - close inactive issue with workflow * ci: close issue, change workflow schedule time	2024-03-16 14:20:53 +02:00
slaren	eb9ea6d425	llama : fix Baichuan2 13B (#6092 )	2024-03-15 23:14:16 +02:00
Theia Vogel	0b21f1b9bc	llama : add support for control vectors (#5970 ) * control vector api and implementation * control-vectors : minor code style updates * disable control vector when data == nullptr use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-15 22:43:02 +02:00
Andrew Canis	b13c2a285a	llama : add Command-R support (#6033 ) Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01	2024-03-15 22:41:22 +02:00
Ting Lou	4e3f9788ba	llava : change API to pure C style for Rust FFI bindgen (#6079 ) Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>	2024-03-15 16:31:05 +02:00
slaren	1c80ddb229	cuda : disable unused cudaLaunchHostFunc code (#6078 )	2024-03-15 14:24:03 +02:00
Neo Zhang Jianyu	58d73e4f6f	fix set main gpu error (#6073 )	2024-03-15 18:53:53 +08:00
Georgi Gerganov	0dba2910a1	make : ggml-metal.o depends on ggml.h	2024-03-15 11:38:40 +02:00
AidanBeltonS	16e7faad23	[SYCL] Fix non-intel device selection (#6042 ) * Fix non-intel device selection * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2024-03-15 14:56:20 +05:30
Ondřej Čertík	9c4886ee7e	gguf : add support for I64 and F64 arrays (#6062 ) * gguf : add support for I64 and F64 arrays GGML currently does not support I64 or F64 arrays and they are not often used in machine learning, however if in the future the need arises, it would be nice to add them now, so that the types are next to the other types I8, I16, I32 in the enums, and it also reserves their type number. Furthermore, with this addition the GGUF format becomes very usable for most computational applications of NumPy (being compatible with the most common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster, and more versatile alternative to the `npz` format, and a simpler alternative to the `hdf5` format. The change in this PR seems small, not significantly increasing the maintenance burden. I tested this from Python using GGUFWriter/Reader and `gguf-dump`, as well as from C, everything seems to work. * Fix compiler warnings	2024-03-15 10:46:51 +02:00
Xuan Son Nguyen	e67a2e3d40	llama : add Orion chat template (#6066 )	2024-03-15 10:44:57 +02:00
slaren	878ec8fef5	llama-bench : use random tokens to improve accuracy with mixtral (#6069 )	2024-03-15 10:22:24 +02:00
Georgi Gerganov	7c4cffe96d	llama : fix integer overflow during quantization (#6063 )	2024-03-14 22:58:41 +02:00
Steve Grubb	e290985462	gguf : fix resource leaks (#6061 ) There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.	2024-03-14 20:29:32 +02:00
Ondřej Čertík	4dd654dc82	gguf-py : bump version to 0.8.0 (#6060 )	2024-03-14 19:57:31 +02:00
Michael Podvitskiy	a88d7966b5	llama : support models without vocabulary (#5798 ) * additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert	2024-03-14 18:21:56 +02:00
Georgi Gerganov	af8deb1449	embedding : add EOS token if not present (#899 )	2024-03-14 15:14:14 +02:00
Georgi Gerganov	0a538d4614	gguf-py : fix dtype check (#6045 )	2024-03-14 13:32:14 +02:00
Jian Liao	01cd238406	readme : improve readme for Llava-1.6 example (#6044 ) Co-authored-by: Jian Liao <jianliao@adobe.com>	2024-03-14 13:18:23 +02:00
Pierrick Hymbert	0c0c0276af	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
Georgi Gerganov	c4253c1ef9	llama : fix typo	2024-03-14 13:13:06 +02:00
Michael Podvitskiy	4390300db7	llama : optimize defrag moves + fix fragmentation calculation (#6037 ) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-14 12:56:48 +02:00
Ondřej Čertík	cb903ba055	gguf-py : add support for I8, I16 and I32 (#6045 ) * Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader	2024-03-14 12:40:14 +02:00
Georgi Gerganov	fdef39cfd9	ggml : designate enum vals for integer types (#6050 )	2024-03-14 12:38:37 +02:00
Georgi Gerganov	286e9cb050	embedding : print all resulting embeddings (#899 )	2024-03-14 12:37:20 +02:00
Georgi Gerganov	faa83237eb	metal : build metallib + fix embed path (#6015 ) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-14 11:55:23 +02:00
Georgi Gerganov	0d197c3a0d	embedding : print cosine similarity (#899 )	2024-03-14 10:12:29 +02:00
Linwei Wang	d2694e37ea	readme : update details about running llama in Termux on Android (#6039 )	2024-03-13 20:34:40 +02:00
Georgi Gerganov	847ed47b30	readme : update API changes and hot topics	2024-03-13 20:33:56 +02:00
Clint Herron	4a6d25766b	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
slaren	f88d2005a4	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
slaren	dc1bd94e29	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
AidanBeltonS	714c607f32	Update get version (#6025 )	2024-03-13 18:47:54 +05:30
Xuan Son Nguyen	2534910086	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
slaren	f91923eccf	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00

1 2 3 4 5 ...

2458 Commits