ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-29 10:51:51 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	e05036aac6	metal : improve dequantize precision to match CPU (#4836 ) ggml-ci	2024-01-09 19:37:08 +02:00
Georgi Gerganov	67c950b390	scripts : improve get-pg.sh (#4838 )	2024-01-09 19:21:13 +02:00
iohub	a1392fc017	readme : add 3rd party collama reference to UI list (#4840 ) Add a VSCode extension for llama.cpp reference to UI list	2024-01-09 18:45:54 +02:00
Georgi Gerganov	377e2df071	scripts : script to get Paul Graham essays in txt format (#4838 )	2024-01-09 16:23:05 +02:00
Behnam M	b680381cfd	server : update readme about token probs (#4777 ) * updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-09 12:02:05 +02:00
Zsapi	2d0d38f5e0	server : add api-key flag to documentation (#4832 ) Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441	2024-01-09 11:12:43 +02:00
Georgi Gerganov	d39e6e0cad	ggml : fix vld1q_s8_x4 32-bit compat (#4828 ) * ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci	2024-01-09 10:42:06 +02:00
Johannes Gäßler	635bddbf8b	CUDA: faster softmax via shared memory + fp16 math (#4742 )	2024-01-09 08:58:55 +01:00
howlger	cfbe33b956	common : fix the short form of `--grp-attn-w`, not `-gat` (#4825 ) See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57	2024-01-08 21:05:53 +02:00
Georgi Gerganov	ac412c4ec1	readme : add link to SOTA models	2024-01-08 20:25:17 +02:00
Kawrakow	7daaac5c6d	SOTA 2-bit quants (#4773 ) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-08 16:02:32 +01:00
Georgi Gerganov	0fe35cdf38	swift : exclude ggml-metal.metal from the package (#4822 )	2024-01-08 16:40:51 +02:00
Georgi Gerganov	3e86f86432	llama.swiftui : update readme	2024-01-08 15:57:36 +02:00
Georgi Gerganov	7ddf5857e7	main : add self-extend support (#4815 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div	2024-01-08 11:18:32 +02:00
Georgi Gerganov	a386b0dd63	examples : add passkey test (#3856 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme	2024-01-08 11:14:04 +02:00
Lars Grammel	9e96d6076a	readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814 )	2024-01-07 22:24:11 +02:00
slaren	d513cfc4b5	llama-bench : add no-kv-offload parameter (#4812 )	2024-01-07 17:59:01 +01:00
Johannes Gäßler	770ec541f9	CUDA: fixed redundant value dequantization (#4809 )	2024-01-07 17:24:08 +01:00
Georgi Gerganov	ec08b3e86f	llama : remove unused vars (#4796 )	2024-01-07 14:29:36 +02:00
Georgi Gerganov	3a96073b59	llama : remove redundant GQA check (#4796 )	2024-01-07 11:21:53 +02:00
Alex Azarov	30df691a96	llama.swiftui : use llama.cpp as SPM package (#4804 )	2024-01-07 10:20:50 +02:00
Georgi Gerganov	52b664aece	llama : print tensor meta for debugging	2024-01-07 09:51:12 +02:00
Alex Azarov	8c36aaf5a8	llama.swiftui : add visionOS target (#4805 )	2024-01-07 09:46:55 +02:00
Konstantin Zhuravlyov	5391345fcc	ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787 )	2024-01-07 08:52:42 +02:00
Georgi Gerganov	003f85d7ea	server : fix n_predict check (#4798 )	2024-01-07 08:45:26 +02:00
Daniel Illescas Romero	34d18eff4c	llama.swiftui : use correct pointer for llama_token_eos (#4797 )	2024-01-06 17:12:59 +02:00
Georgi Gerganov	33c9d849fd	examples : improve base-translate.sh script (#4783 )	2024-01-06 11:40:24 +02:00
a-n-n-a-l-e-e	b52357162d	cmake : check for openblas64 (#4134 ) openblas v0.3.22 64-bit pkg-config file is named openblas64.pc https://github.com/OpenMathLib/OpenBLAS/issues/3790	2024-01-05 18:04:40 +02:00
Ikko Eltociear Ashimine	f4ee045ad0	flake.nix : fix typo (#4700 ) betwen -> between	2024-01-05 18:02:44 +02:00
Georgi Gerganov	7e27e37f26	metal : switch back to default.metallib (ggml/681) ggml-ci	2024-01-05 18:02:06 +02:00
Georgi Gerganov	d6ec7cfc70	ggml : fix q2_k bpw in comments (ggml/680)	2024-01-05 18:02:06 +02:00
Finn Voorhees	0630261a48	ggml : add error handling to graph_compute (whisper/1714)	2024-01-05 18:02:06 +02:00
Georgi Gerganov	5ffddb870b	ggml : do not sched_yield when calling BLAS (#4761 ) * ggml : do not sched_yield when calling BLAS ggml-ci * ggml : fix do_yield logic ggml-ci * ggml : simplify do_yield logic ggml-ci	2024-01-05 15:18:21 +02:00
Georgi Gerganov	41ced5ce3c	examples : add few-shot translation example (#4783 )	2024-01-05 15:11:10 +02:00
Daniel Bevenius	0c4cb7138c	finetune : remove unused includes (#4756 ) This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-04 21:45:37 +02:00
Georgi Gerganov	82e82f484d	server : send token probs for "stream == false" (#4714 )	2024-01-04 19:56:33 +02:00
Johannes Gäßler	b0a9bb90f9	Print backend name on test-backend-ops failure (#4751 )	2024-01-04 09:43:23 +01:00
singularity	2d08e99f47	llama.swiftui : support loading custom model from file picker (#4767 ) * swiftui: support load model from file picker * swiftui: remove trailing whitespace	2024-01-04 10:22:38 +02:00
Michael Coppola	85648efa9e	server : fix options in README.md (#4765 ) * fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 10:17:09 +02:00
Georgi Gerganov	7967c42ffb	ggml : include stdlib.h before intrin.h (#4736 )	2024-01-04 10:12:26 +02:00
singularity	c399a87c6b	llama.swiftui : fix build of ggml.metallib (#4754 ) * metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 09:58:16 +02:00
Daniel Bevenius	41a287de3c	train : fix typo in overlapping-samples help msg (#4758 ) This commit fixes a typo in the help message for the --overlapping-samples option. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-03 19:53:40 +02:00
Ashraful Islam	59092ff962	swift : update Package.swift to use ggml as dependency (#4691 ) * updates the package.swift to use ggml as dependency * changes the ggml package url src to ggerganov	2024-01-03 19:30:02 +02:00
Georgi Gerganov	f2001ff46d	cuda : simplify expression Co-authored-by: slaren <slarengh@gmail.com>	2024-01-03 14:38:38 +02:00
Georgi Gerganov	09d890cb54	cuda : mark I16 and I32 ops as unsupported ggml-ci	2024-01-03 14:38:38 +02:00
Georgi Gerganov	4ebea0bdce	sync : ggml ggml-ci	2024-01-03 14:38:38 +02:00
Georgi Gerganov	514561978d	metal : add kernel_get_rows_i32 ggml-ci	2024-01-03 14:38:38 +02:00
Georgi Gerganov	74b4d9c1ed	scripts : fix sync order + metal sed	2024-01-03 14:38:38 +02:00
Guillaume Wenzek	b2cfdd2ea3	ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639) * add more int ops * ggml_compute_forward_dup_bytes * add tests * PR comments * tests : minor indentations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-03 14:38:38 +02:00
Justin Parker	5b56760f5c	server : throw an error when `slot unavailable` (#4741 )	2024-01-03 10:43:19 +02:00

... 3 4 5 6 7 ...

2002 Commits