ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-22 07:29:23 +00:00

Author	SHA1	Message	Date
crasm	b5ad7196d9	python : add check-requirements.sh and GitHub workflow (#4585 ) * python: add check-requirements.sh and GitHub workflow This script and workflow forces package versions to remain compatible across all convert.py scripts, while allowing secondary convert scripts to import dependencies not wanted in convert.py. Move requirements into ./requirements * Fail on "==" being used for package requirements (but can be suppressed) * Enforce "compatible release" syntax instead of == * Update workflow * Add upper version bound for transformers and protobuf * improve check-requirements.sh * small syntax change * don't remove venvs if nocleanup is passed * See if this fixes docker workflow * Move check-requirements.sh into ./scripts/ --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2023-12-29 16:50:29 +02:00
Philip Taron	97afbf4d29	flake.nix : rewrite (#4605 ) * flake.lock: update to hotfix CUDA::cuda_driver Required to support https://github.com/ggerganov/llama.cpp/pull/4606 * flake.nix: rewrite 1. Split into separate files per output. 2. Added overlays, so that this flake can be integrated into others. The names in the overlay are `llama-cpp`, `llama-cpp-opencl`, `llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs). 3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/) rather than `with pkgs;` so that there's dependency injection rather than dependency lookup. 4. Add a description and meta information for each package. The description includes a bit about what's trying to accelerate each one. 5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge. 6. Format with `serokell/nixfmt` for a consistent style. 7. Update `flake.lock` with the latest goods. * flake.nix: use finalPackage instead of passing it manually * nix: unclutter darwin support * nix: pass most darwin frameworks unconditionally ...for simplicity * .nix: nixfmt nix shell github:piegamesde/nixfmt/rfc101-style --command \ nixfmt flake.nix .devops/nix/.nix * flake.nix: add maintainers * nix: move meta down to follow Nixpkgs style more closely * nix: add missing meta attributes nix: clarify the interpretation of meta.maintainers nix: clarify the meaning of "broken" and "badPlatforms" nix: passthru: expose the use* flags for inspection E.g.: ``` ❯ nix eval .#cuda.useCuda true ``` * flake.nix: avoid re-evaluating nixpkgs too many times * flake.nix: use flake-parts * nix: migrate to pname+version * flake.nix: overlay: expose both the namespace and the default attribute * ci: add the (Nix) flakestry workflow * nix: cmakeFlags: explicit OFF bools * nix: cuda: reduce runtime closure * nix: fewer rebuilds * nix: respect config.cudaCapabilities * nix: add the impure driver's location to the DT_RUNPATHs * nix: clean sources more thoroughly ...this way outPaths change less frequently, and so there are fewer rebuilds * nix: explicit mpi support * nix: explicit jetson support * flake.nix: darwin: only expose the default --------- Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>	2023-12-29 16:42:26 +02:00
Cuong Trinh Manh	5f8aa28f03	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
Justine Tunney	4e00adf486	llava-cli : refactor to use sampling library (#4669 ) This change makes it possible to use flags like `--grammar` when using the `llava-cli` program. The rest is just code cleanup deleting a long standing TODO comment. This change also ensures that logging information is emitted to stderr which helps the `llava-cli` command be more friendly to shell scripts. See Mozilla-Ocho/llamafile@1cd334f	2023-12-29 16:38:38 +02:00
Justine Tunney	a2a1f7333e	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	bb6f9cfce2	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	be677135fb	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
andrijdavid	fe6e204f91	main-cmake-pkg : fix build issue (#4665 ) * Fix main-cmake-pkg compilation * Use glob to load common files * cmake : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 16:18:20 +02:00
Peter Sugihara	0f60ba09ce	llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674 ) * fix infinite loop * slight UI simplification, clearer UX * clearer UI text, add timings to completion log	2023-12-29 15:58:56 +02:00
Georgi Gerganov	2753b503bc	scripts : print list of sync commits	2023-12-29 15:12:35 +02:00
Tamotsu Takahashi	0cb249f0e7	ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576) * Build with CLBlast * Declare GGML_API After rebasing, examples/talk-llama failed: "D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) -> "D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) -> (Link target) -> llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context ,void (__cdecl)(float,void ),void ,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]	2023-12-29 15:11:53 +02:00
Georgi Gerganov	ff7ec2ba2c	sync : ggml	2023-12-29 14:56:41 +02:00
bssrdf	3060718bfe	ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669) * fixed mul-mat error for old GPUs * style fixes * add mul mat src1 f16 test cases, fix more cases ggml-ci --------- Co-authored-by: bssrdf <bssrdf@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-29 14:54:19 +02:00
Georgi Gerganov	8c08d65631	scripts : do not sync commits from this repo	2023-12-29 14:54:05 +02:00
Justine Tunney	ca7d2aabab	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
manikbhandari	d818687d5a	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
Nam D. Tran	28c5cf95d5	llama : add AWQ for llama, llama2, mpt, and mistral models (#4593 ) * update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io> Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-27 17:39:45 +02:00
Daniel Bevenius	766ccb2615	finetune : fix output formatting in print_params (#4653 ) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-27 16:16:55 +02:00
Georgi Gerganov	95dad80615	scripts : add sync-ggml-am.sh	2023-12-27 11:44:22 +02:00
Georgi Gerganov	352ef0145a	ggml : fix dot product for ARM (#4630 ) ggml-ci	2023-12-27 11:02:13 +02:00
wonjun Jang	e99ba145fd	Add byte token type when tokenizer.model is not exists (#4641 ) * Add byte token type to hf format * remove unused variable	2023-12-27 17:37:25 +09:00
slaren	aa98a787da	cuda : fix vmm pool with multi GPU (#4620 ) * cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device	2023-12-26 21:23:59 +01:00
WillCorticesAI	676177b3ba	Update comment for AdamW implementation reference. (#4604 ) Co-authored-by: Will Findley <findley@gmail.com>	2023-12-26 11:42:08 +01:00
FantasyGmm	d7b195a340	Fix new CUDA10 compilation errors (#4635 )	2023-12-26 11:38:36 +01:00
Paul Tsochantaris	fe79201273	Adding Emeltal reference to UI list (#4629 )	2023-12-25 18:09:53 +02:00
slaren	c108921638	simplify bug issue template (#4623 )	2023-12-24 22:01:12 +02:00
Shintarou Okada	659fe6b867	llama : add PLaMo model (#3557 ) * add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-24 15:35:49 +02:00
slaren	556d3ccdb7	cuda : improve cuda pool efficiency using virtual memory (#4606 ) * cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-12-24 14:34:22 +01:00
slaren	55ddd2af64	fallback to CPU buffer if host buffer alloc fails (#4610 )	2023-12-23 16:10:51 +01:00
Samuel Maynard	c2b5b7b128	ci(docker): fix tags in "Build and push docker image (tagged)" (#4603 )	2023-12-23 11:35:55 +02:00
Alexey Parfenov	593a2e1be5	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
kalomaze	57c6c251f8	grammar : check the full vocab only if necessary (opt) (#4306 ) * Check the full vocab for grammar only if necessary * Fix missing logit restoration step (?) Does this matter, actually? * Fix whitespace / formatting * Adjust comment * Didn't mean to push test gbnf * Split sampling into the helper function (?) And also revert the changes made to the header * common : fix final newline --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-23 11:27:07 +02:00
Johannes Gäßler	075c1781f6	CUDA: fixed row rounding for 0 tensor splits (#4594 )	2023-12-23 09:16:33 +01:00
LeonEricsson	4f3f1b832f	lookup : add prompt lookup decoding example (#4484 ) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 18:05:56 +02:00
Georgi Gerganov	f47ce05128	sync : ggml (fix im2col) (#4591 ) * cuda : fix im2col_f32_f16 (ggml/#658) ggml-ci * ggml-alloc : fix ggml_tallocr_is_own --------- Co-authored-by: leejet <leejet714@gmail.com>	2023-12-22 17:53:43 +02:00
FantasyGmm	ce2c5517e6	cuda : fix jetson compile error (#4560 ) * fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 17:11:12 +02:00
Henrik Forstén	e7ab935755	Fix CudaMemcpy direction (#4599 )	2023-12-22 14:34:05 +01:00
slaren	770b4b6cc6	llama : fix platforms without mmap (#4578 ) * llama : fix platforms without mmap * win32 : limit prefetch size to the file size * fix win32 error clobber, unnecessary std::string in std::runtime_error	2023-12-22 13:12:53 +02:00
Herman Semenov	03ef3ea477	ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203 )	2023-12-22 11:26:49 +02:00
Michael Kesper	10c9ac210e	make : add LLAMA_HIP_UMA option (#4587 ) NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA	2023-12-22 10:03:25 +02:00
rhuddleston	0d1176daa9	ci : tag docker image with build number (#4584 )	2023-12-22 08:56:34 +02:00
Deins	55c7355ee0	readme : add zig bindings (#4581 )	2023-12-22 08:49:54 +02:00
bobqianic	99f545dc77	ggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579 )	2023-12-22 08:47:01 +02:00
crasm	2766ad3491	llama : add ability to cancel model loading (#4462 ) * llama : Add ability to cancel model load Updated llama_progress_callback so that if it returns false, the model loading is aborted. * llama : Add test for model load cancellation * Fix bool return in llama_model_load, remove std::ignore use * Update llama.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Fail test if model file is missing * Revert "Fail test if model file is missing" This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d. * Add test-model-load-cancel to Makefile * Revert "Revert "Fail test if model file is missing"" This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b. * Simplify .gitignore for tests, clang-tidy fixes * Label all ctest tests * ci : ctest uses -L main * Attempt at writing ctest_with_model * ci : get ci/run.sh working with test-model-load-cancel * ci : restrict .github/workflows/build.yml ctest to -L main * update requirements.txt * Disable test-model-load-cancel in make * Remove venv before creation * Restructure requirements.txt Top-level now imports the specific additional requirements for each python file. Using `pip install -r requirements.txt` will fail if versions become mismatched in the per-file requirements. * Make per-python-script requirements work alone This doesn't break the main requirements.txt. * Add comment * Add convert-persimmon-to-gguf.py to new requirements.txt scheme * Add check-requirements.sh script and GitHub workflow * Remove shellcheck installation step from workflow * Add nocleanup special arg * Fix merge see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573 * reset to upstream/master * Redo changes for cancelling model load --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-12-22 08:19:36 +02:00
Georgi Gerganov	f330ea5c2e	ggml : change ggml_scale to take a float instead of tensor (#4573 ) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci	2023-12-21 23:20:49 +02:00
Georgi Gerganov	b737eca163	gguf-py : fix broken link	2023-12-21 23:20:36 +02:00
Georgi Gerganov	4c919cc3e8	gguf : simplify example dependencies	2023-12-21 23:08:14 +02:00
Samuel Maynard	a4f5d18157	ci : add `jlumbroso/free-disk-space` to docker workflow (#4150 ) * [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo * [github][workflows][docker]: adds `jlumbroso/free-disk-space`	2023-12-21 22:36:26 +02:00
slaren	f5aeec7ecc	llama : initial ggml-backend integration (#4520 ) * llama : initial ggml-backend integration * add ggml-metal * cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST access all tensor data with ggml_backend_tensor_get/set * add ggml_backend_buffer_clear zero-init KV cache buffer * add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data * disable gpu backends with ngl 0 * more accurate mlock * unmap offloaded part of the model * use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap * update quantize and lora * update session copy/set to use ggml-backend ggml-ci * use posix_fadvise instead of posix_fadvise64 * ggml_backend_alloc_ctx_tensors_from_buft : remove old print * llama_mmap::align_offset : use pointers instead of references for out parameters * restore progress_callback behavior * move final progress_callback call to load_all_data * cuda : fix fprintf format string (minor) * do not offload scales * llama_mmap : avoid unmapping the same fragments again in the destructor * remove unnecessary unmap * metal : add default log function that prints to stderr, cleanup code ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 21:07:46 +01:00
Marcus Dunn	77618b25cb	llama : allow getting n_batch from llama_context in c api (#4540 ) * allowed getting n_batch from llama_context in c api * changed to use `uint32_t` instead of `int` * changed to use `uint32_t` instead of `int` in `llama_n_ctx` * Update llama.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 21:57:48 +02:00

1 2 3 4 5 ...

1724 Commits