ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 21:24:08 +00:00

Author	SHA1	Message	Date
Evan Miller	7282a23a5e	mpi : add support for distributed inference via MPI (#2099 ) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-10 18:49:56 +03:00
oobabooga	645219e901	llama : remove "first token must be BOS" restriction (#2153 )	2023-07-09 11:59:53 +03:00
Nigel Bosch	daecfe4059	main : escape prompt prefix/suffix (#2151 )	2023-07-09 11:56:18 +03:00
JackJollimore	a9f858bae7	readme : update Termux instructions (#2147 ) The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.	2023-07-09 11:20:43 +03:00
clyang	80192cec3a	ggml : fix buidling with Intel MKL but ask for "cblas.h" issue (#2104 ) (#2115 ) * Fix buidling with Intel MKL but ask for "cblas.h" issue * Use angle brackets to indicate the system library	2023-07-09 11:12:20 +03:00
rankaiyx	e62b9c1c04	readme : add more docs indexes (#2127 ) * Update README.md to add more docs indexes * Update README.md to add more docs indexes	2023-07-09 10:38:42 +03:00
Johannes Gäßler	36054e5b18	Fixed OpenLLaMA 3b CUDA mul_mat_vec_q (#2144 )	2023-07-08 20:01:44 +02:00
Johannes Gäßler	305b61f7a2	CUDA: add __restrict__ to mul mat vec kernels (#2140 )	2023-07-08 00:25:15 +02:00
dylan	7466f4edad	docker : add support for CUDA in docker (#1461 ) Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 21:25:25 +03:00
Georgi Gerganov	d94ea1d98a	ci : switch threads to 1 (#2138 )	2023-07-07 21:23:57 +03:00
Qingyou Meng	d259d42719	ggml : change ggml_graph_compute() API to not require context (#1999 ) * ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 * rewrite: no longer consider backward compitability; plan and make_plan * minor: rename ctx as plan; const * remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward * add static ggml_graph_compute_sugar() * minor: update comments * reusable buffers * ggml : more consistent naming + metal fixes * ggml : fix docs * tests : disable grad / opt + minor naming changes * ggml : add ggml_graph_compute_with_ctx() - backwards compatible API - deduplicates a lot of copy-paste * ci : enable test-grad0 * examples : factor out plan allocation into a helper function * llama : factor out plan stuff into a helper function * ci : fix env * llama : fix duplicate symbols + refactor example benchmark * ggml : remove obsolete assert + refactor n_tasks section * ggml : fix indentation in switch * llama : avoid unnecessary bool * ggml : remove comments from source file and match order in header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 19:24:01 +03:00
Georgi Gerganov	77f39d5b4f	ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134 )	2023-07-07 18:37:10 +03:00
Aarni Koskela	11cda014ac	convert.py: add mapping for safetensors bf16 (#1598 ) Fixes #1473	2023-07-07 09:12:49 -04:00
Howard Su	b29cd6ba6d	Fix opencl by wrap #if-else-endif with \n (#2086 )	2023-07-07 05:34:18 +02:00
Georgi Gerganov	0949e52280	ggml : fix restrict usage	2023-07-06 19:41:31 +03:00
Judd	56e5233320	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>	2023-07-06 19:23:49 +03:00
tslmy	6815028f35	alpaca.sh : update model file name (#2074 ) The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `ggmlv3.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.	2023-07-06 19:17:50 +03:00
Tobias Lütke	2978dd92ec	Expose generation timings from server & update completions.js (#2116 ) * use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson	8a3fecd99b	Update Server Instructions (#2113 ) * Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 21:03:19 +03:00
Georgi Gerganov	bd32bac50c	ggml : fix bug introduced in #1237	2023-07-05 20:44:11 +03:00
Georgi Gerganov	5ac2fe8ccf	tests : fix test-grad0	2023-07-05 20:20:25 +03:00
Stephan Walter	e0a5b08cdc	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson	2dc0a82cd0	Update server instructions for web front end (#2103 ) Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 18:13:35 +03:00
Johannes Gäßler	aa060f64de	Quantized dot products for CUDA mul mat vec (#2067 )	2023-07-05 14:19:42 +02:00
Howard Su	928a2061d8	llama: Don't double count the sampling time (#2107 )	2023-07-05 18:31:23 +08:00
Johannes Gäßler	bfbd12322e	Fixed OpenCL offloading prints (#2082 )	2023-07-05 08:58:05 +02:00
Nigel Bosch	fdc61bd755	embd-input: Fix input embedding example unsigned int seed (#2105 )	2023-07-05 07:33:33 +08:00
Georgi Gerganov	d9c33ac749	readme : add link web chat PR	2023-07-04 22:25:22 +03:00
Georgi Gerganov	40c6a525ea	ggml : sync latest (new ops, macros, refactoring) (#2106 ) - add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c	2023-07-04 21:54:11 +03:00
jwj7140	d7d540c04d	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding	2023-07-04 21:06:12 +03:00
Tobias Lütke	111f1b47ab	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-04 16:05:27 +02:00
Henri Vasserman	f1533445e7	Allow old Make to build server. (#2098 ) Also make server build by default. Tested with Make 3.82	2023-07-04 15:38:04 +03:00
ZhouYuChen	7342fb172e	Update Makefile: clean simple (#2097 )	2023-07-04 14:15:16 +02:00
Erik Scholz	bdf2652c83	CI: make the brew update temporarily optional. (#2092 ) until they decide to fix the brew installation in the macos runners. see the open issues. eg https://github.com/actions/runner-images/pull/7710	2023-07-04 01:50:12 +02:00
Govlzkoy	1d97415a70	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088 )	2023-07-04 07:50:00 +08:00
Henri Vasserman	8e9801d5b0	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
Howard Su	b4acd69f66	Fix crash of test-tokenizer-0 under Debug build (#2064 ) * Fix crash of test-tokenizer-0 under Debug build * Change per comment	2023-07-03 20:43:55 +02:00
Howard Su	1a58c36ca8	[llama] No need to check file version when loading vocab score (#2079 )	2023-07-03 19:58:58 +08:00
WangHaoranRobin	552a05c6dc	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together	2023-07-03 00:38:44 +03:00
Georgi Gerganov	8fe97102e5	ggml : fix build with OpenBLAS (close #2066 )	2023-07-02 09:46:46 +03:00
Johannes Gäßler	1e9814bf70	Better CUDA synchronization logic (#2057 )	2023-07-01 21:49:44 +02:00
Johannes Gäßler	2764eaeaf7	Test-based VRAM scratch size + context adjustment (#2056 )	2023-07-01 21:47:26 +02:00
Daniel Drake	abbce9e3f9	cmake : don't force -mcpu=native on aarch64 (#2063 ) It's currently not possible to cross-compile llama.cpp for aarch64 because CMakeLists.txt forces -mcpu=native for that target. -mcpu=native doesn't make sense if your build host is not the target architecture, and clang rejects it for that reason, aborting the build. This can be easily reproduced using the current Android NDK to build for aarch64 on an x86_64 host. If there is not a specific CPU-tuning target for aarch64 then -mcpu should be omitted completely. I think that makes sense, there is not enough variance in the aarch64 instruction set to warrant a fixed -mcpu optimization at this point. And if someone is building natively and wishes to enable any possible optimizations for the host device, then there is already the LLAMA_NATIVE option available. Fixes #495.	2023-07-01 21:31:44 +03:00
Aaron Miller	b67aea635d	metal : release buffers when freeing metal context (#2062 )	2023-07-01 21:14:59 +03:00
Judd	94e1a0ab7d	convert : add support of baichuan-7b (#2055 ) Co-authored-by: Judd <foldl@boxvest.com>	2023-07-01 20:00:25 +03:00
Georgi Gerganov	fcce8eb52b	llama : fix return value of llama_load_session_file_internal (#2022 )	2023-07-01 19:05:09 +03:00
Rand Xie	f44d5638b8	llama : catch llama_load_session_file_internal exceptions (#2022 ) * convert checks in llama_load_session_file to throw and handle them * make llama_load_session_file_internal static * address feedbacks to avoid using exceptions	2023-07-01 19:02:58 +03:00
Georgi Gerganov	0f42fa7e17	embd-input : fix returning ptr to temporary	2023-07-01 18:46:00 +03:00
Georgi Gerganov	080d870d99	train : fix compile warning	2023-07-01 18:45:44 +03:00
Qingyou Meng	6fcaeb4790	ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995 ) Will not be scheduled unless explicitly enabled.	2023-07-01 18:42:43 +03:00

1 2 3 4 5 ...

813 Commits