ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-24 00:19:19 +00:00

Author	SHA1	Message	Date
Ed Lepedus	a57ef1110e	server: add cURL support to server Dockerfiles (#6474 ) * server: add cURL support to `full.Dockerfile` * server: add cURL support to `full-cuda.Dockerfile` and `server-cuda.Dockerfile` * server: add cURL support to `full-rocm.Dockerfile` and `server-rocm.Dockerfile` * server: add cURL support to `server-intel.Dockerfile` * server: add cURL support to `server-vulkan.Dockerfile` * fix typo in `server-vulkan.Dockerfile` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-04 18:31:22 +02:00
Minsoo Cheong	016aa58b11	ci: exempt master branch workflows from getting cancelled (#6486 ) * ci: exempt master branch workflows from getting cancelled * apply to bench.yml	2024-04-04 18:30:53 +02:00
Ewout ter Hoeven	7d32d7d775	build CI: Name artifacts (#6482 ) Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP. It might be possible to further simplify the packing step (in future PRs).	2024-04-04 17:08:55 +02:00
Shakhar Dasgupta	deefac27f2	server: allow penalizing repetition of newlines on server webpage (#6431 )	2024-04-04 17:03:00 +02:00
Pierrick Hymbert	fd66566ee1	ci: bench fix concurrency for workflow trigger dispatch with sha1 (#6478 )	2024-04-04 16:59:04 +02:00
limitedAtonement	997c0854b4	Correct README link (#6458 ) README is called README.md.	2024-04-04 16:30:02 +02:00
Pierrick Hymbert	4cf330fb1e	ci: bench: add more ftype, fix triggers and bot comment (#6466 ) * ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces	2024-04-04 12:57:58 +03:00
Daniel Bevenius	6d074393e8	common: remove duplicate check for curl (#6471 ) This commit removes one of the two identical checks for curl being NULL in llama_load_model_from_url. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-04-04 09:49:21 +02:00
Clint Herron	9179276b55	examples : add GBNF validator program (#5948 ) * Revising GBNF validator program to be much simpler. * Changing from streams to using cstdio * Adding final newline character.	2024-04-04 10:44:28 +03:00
Georgi Gerganov	af0871e8a7	server : remove obsolete --memory-f32 option	2024-04-04 09:34:58 +03:00
Xiao-Yong Jin	abdfc39ec8	server : add option to disable KV offload (#6468 )	2024-04-04 09:33:48 +03:00
Clint Herron	d7a430215b	convert : fix for lint error complaining of bare except (#6470 )	2024-04-04 09:32:53 +03:00
Fattire	7ee41c6e35	A few small fixes to server's README docs (#6428 ) * Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.	2024-04-03 22:22:57 +02:00
JH23X	0f4e3af782	server : handle exception on wrong type in request (#6452 ) Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>	2024-04-03 21:09:52 +03:00
bryanSwk	919a8a7a7c	llama : add SEA-LION support (#6448 ) * initial commit for sealion support * add sealion support * minor fix * q/k ln and pos_embd only if required * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : clear whitespaces --------- Co-authored-by: bryan <bryansiow@aisingapore.org> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-03 21:05:10 +03:00
Ewout ter Hoeven	e9cd978f7e	ci : update checkout, setup-python and upload-artifact to latest (#6456 ) * CI: Update actions/checkout to v4 * CI: Update actions/setup-python to v5 * CI: Update actions/upload-artifact to v4	2024-04-03 21:01:13 +03:00
Ed Lepedus	be830a3c64	server: add cURL support to `server.Dockerfile` (#6461 )	2024-04-03 19:56:37 +02:00
Francisco Melo	840675869e	readme : add feature-rich rust bindings (#6465 )	2024-04-03 20:53:37 +03:00
Joyce	d504285ccb	security : create policy (#6354 ) * Create SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * minor * fix * fix --------- Signed-off-by: Joyce <joycebrum@google.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-03 20:48:07 +03:00
Abhishek Gopinath K	20b9b65e75	Missing tokenizer.model error during gguf conversion (#6443 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-04-03 11:42:52 -04:00
kaizau	17d75e9340	Add OpenChat, Alpaca, Vicuna chat templates (#6397 ) * Add openchat chat template * Add chat template test for openchat * Add chat template for vicuna * Add chat template for orca-vicuna * Add EOS for vicuna templates * Combine vicuna chat templates * Add tests for openchat and vicuna chat templates * Add chat template for alpaca * Add separate template name for vicuna-orca * Remove alpaca, match deepseek with jinja output * Regenerate chat template test with add_generation_prompt * Separate deepseek bos from system message * Match openchat template with jinja output * Remove BOS token from templates, unprefix openchat	2024-04-03 17:24:31 +02:00
Georgi Gerganov	4fd21ec901	readme : update hot topics	2024-04-03 16:11:15 +03:00
slaren	5d3839837b	ggml : mul_mat_id use the same tensor for all the experts (#6387 ) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-03 16:07:05 +03:00
Meng, Hengyu	d29450b4ca	[SYCL] Disable iqx on windows as WA (#6435 ) * disable iqx on windows as WA * array instead of global_memory	2024-04-03 10:34:40 +08:00
Georgi Gerganov	fd20ccef2e	flake.lock: Update (#6402 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-04-01 09:05:57 -07:00
Johannes Gäßler	5a5d9cbbe2	compare-llama-bench.py: fix long hexsha args (#6424 )	2024-04-01 13:30:43 +02:00
Pierrick Hymbert	5f580568dd	ci: server: verify deps are coherent with the commit (#6409 ) * ci: server: verify deps are coherent with the commit * ci: server: change the ref to build as now it's a pull event target	2024-04-01 12:36:40 +02:00
Georgi Gerganov	47038dcea2	readme : update hot topics	2024-03-31 11:56:30 +03:00
Pierrick Hymbert	f0be4bd555	ci: bench: fix Resource not accessible by integration on PR event (#6393 )	2024-03-30 12:36:07 +02:00
Mohammadreza Hendiani	8dbe6f877d	Fedora build update (#6388 ) * fixed deprecated address * fixed deprecated address * fixed deprecated address * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * reverted back to only the MIT license	2024-03-29 22:59:56 +01:00
Xuan Son Nguyen	75b580db0a	split: allow --split-max-size option (#6343 ) * split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size	2024-03-29 22:34:44 +01:00
0cc4m	134e314654	Vulkan k-quant mmq and ggml-backend offload functionality (#6155 ) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning	2024-03-29 17:29:21 +01:00
Georgi Gerganov	537fc022b8	sync : ggml (#6351 ) * sync : ggml ggml-ci * cuda : move GGML_CUDA_DMMV constants to dmmv.cuh --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-29 17:45:46 +02:00
hxer7963	7861298830	[Model] Add support for xverse (#6301 ) * Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>	2024-03-29 14:37:03 +01:00
Georgi Gerganov	71ba2b4748	ci : fix BGE wget (#6383 ) ggml-ci	2024-03-29 14:34:28 +02:00
zhouwg	2f09ee47af	readme : add project (#6356 ) * readme: add Android UI binding * Update README.md	2024-03-29 09:33:46 +02:00
Matt Clayton	8ae35eee39	cmake : add explicit metal version options (#6370 ) * cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-29 09:27:42 +02:00
Daniel Bevenius	87ef849926	llama : remove redundant reshape in build_kv_store (#6369 ) * llama: remove redundant reshape in build_kv_store This commit removes the reshape of the V matrix in the build_kv_store. The motivation for this is that V matrix has the shape: ```console (gdb) p v_cur $46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` And after reshaping this tensor we get: ```console gdb) p ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens) $44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_RESHAPE, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0, name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` I noticed that the `src` and `view_src` fields are different but that the dimensions are the same. From the code comment it seems like the reshape call is not needed and perhaps the above can motivate the removal of the reshape call. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add assert --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-29 09:23:22 +02:00
Pedro Cuenca	88b3d00023	convert : allow conversion of Mistral HF models (#6144 ) * Allow conversion of Mistral HF models * Homogenize Llama, Mistral, Mixtral under the same entry. * Fix tokenizer, permute tensors * Use sentencepiece tokenizer, or fall back to hfft. * convert-hf : small fix for mypy * convert-hf : fix duplicated block_count * convert-hf : add vocab size to metadata --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-03-29 09:15:00 +02:00
Georgi Gerganov	18c218bbc8	readme : add notice for UI list	2024-03-28 22:56:03 +02:00
Ouadie EL FAROUKI	2d5c7313cf	[SYCL] Revisited & updated SYCL build documentation (#6141 ) * Revisited & updated SYCL build documentation * removed outdated comment * Addressed PR comments * Trimed white spaces * added new end line	2024-03-28 16:01:47 +00:00
Jared Van Bortel	df064a721d	convert : refactor vocab selection logic (#6355 )	2024-03-28 11:44:36 -04:00
Ziang Wu	dbf509a459	llava : fix MobileVLM (#6364 ) * fix empty bug * Update MobileVLM-README.md added more results on devices * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update MobileVLM-README.md remove gguf links --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-28 16:33:10 +02:00
compilade	af949dc9d7	llama : fix command-r inference when omitting outputs (#6367 )	2024-03-28 14:05:54 +02:00
Pierrick Hymbert	e685e83d37	ci: bench: fix master not schedule, fix commit status failed on external repo (#6365 )	2024-03-28 11:27:56 +01:00
Ting Sun	49c535c478	doc: fix outdated default value of batch size (#6336 ) * doc: fix outdated default value of batch size * doc: add doc for ubatch-size	2024-03-28 09:51:06 +01:00
Eric Zhang	7d68981aa1	server : stop gracefully on SIGTERM (#6348 )	2024-03-28 09:50:48 +01:00
hutli	c31404122b	nix: removed unnessesary indentation	2024-03-28 07:48:27 +00:00
hutli	307f7a2c76	nix: moved blas availability check to package inputs so it is still overridable	2024-03-28 07:48:27 +00:00
hutli	1adb187009	using blas.meta.available to check host platform	2024-03-28 07:48:27 +00:00

1 2 3 4 5 ...

2609 Commits