ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-22 07:29:23 +00:00

Author	SHA1	Message	Date
k.h.lai	9e51f2e934	vulkan: add workaround for iterator boundary check to fix clang-cl debug build (#7426 )	2024-05-22 14:53:21 +02:00
Justine Tunney	6aed746d28	llama : add missing model type names (#7445 )	2024-05-22 14:08:18 +03:00
Georgi Gerganov	5b113062d5	cuda : fix compile warning (#7454 )	2024-05-22 12:36:37 +03:00
Johannes Gäßler	71bf04b8bd	CUDA: remove incorrect precision check (#7454 )	2024-05-22 10:24:29 +02:00
Georgi Gerganov	300da6320d	cuda : fix rope + add tests (#7452 ) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-05-22 11:01:35 +03:00
liuwei-git	c1a6ad7577	llama : add phi3 128K model support (#7225 ) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-21 23:28:32 +03:00
Georgi Gerganov	58ca88c1f3	metal : handle F16 inf values, fix FA partial offload (#7434 ) ggml-ci	2024-05-21 23:03:42 +03:00
Olivier Chafik	287fa980b8	`grammars`: fix resampling logic regression (#7424 )	2024-05-21 20:40:00 +01:00
Johannes Gäßler	1f2bce9bc2	CUDA: fix unused warning in mmq.cu (#7442 )	2024-05-21 20:27:12 +03:00
Georgi Gerganov	61ab7a8eb1	tests : test-tokenizer-0.sh print more info (#7402 )	2024-05-21 19:53:48 +03:00
Amir	e205f11bbc	examples: cache hf model when --model not provided (#7353 ) * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided	2024-05-21 17:13:12 +03:00
Johannes Gäßler	260949cad5	CUDA: deduplicate mmq code (#7397 )	2024-05-21 16:02:12 +02:00
jaime-m-p	0dbe001317	Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425 ) * Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"	2024-05-21 14:39:48 +02:00
jaime-m-p	49a32c0167	Tokenizer SPM fixes for phi-3 and llama-spm (#7375 ) * Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes	2024-05-20 20:15:57 +02:00
Georgi Gerganov	60faeefff0	llama : remove Persimmon (#7408 ) * llama : remove Persimmon * requirements : remove	2024-05-21 02:35:28 +10:00
Johannes Gäßler	a2a24aec6f	perplexity: update README FP16 results [no ci] (#7413 )	2024-05-20 18:15:38 +02:00
Radoslav Gerganov	9b6d4a568c	rpc : track allocated buffers (#7411 ) * rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly	2024-05-20 16:36:55 +03:00
Georgi Gerganov	31b2d6e05b	server : fix temperature + disable some tests (#7409 ) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo	2024-05-20 22:10:03 +10:00
AidanBeltonS	6de307daa8	[SYCL] Update SYCL upscale operation (#7321 ) * Update SYCL upscale operation * Formatting * Remove messages	2024-05-20 16:38:23 +05:30
Bingan	65fca291a0	Update README.md (#7410 )	2024-05-20 11:55:34 +02:00
Herman Semenov	a00e636fc5	ggml-opencl, llama: using reserve() if count already known (#7272 )	2024-05-20 10:33:21 +03:00
junchao-loongson	0ad2755e84	ggml : add loongarch lsx and lasx support (#6454 ) * add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>	2024-05-20 10:19:21 +03:00
Georgi Gerganov	c930c28bec	server : tuning tests (#7388 ) * server : don't pass temperature as string * server : increase timeout * tests : fix the fix 0.8f -> 0.8 ggml-ci * tests : set explicit temperature	2024-05-20 10:16:41 +03:00
Georgi Gerganov	9cc3a7c871	server : return error on too large embedding input (#7389 )	2024-05-20 08:56:05 +03:00
Georgi Gerganov	8a5e27cbd7	tests : fix --keep_split -> --keep-split (#7374 )	2024-05-20 08:55:09 +03:00
Srihari-mcw	2f4cf4d13a	Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (#7258 )	2024-05-20 12:18:39 +10:00
slaren	cb9cf0fb9b	llama : remove MPI backend (#7395 )	2024-05-20 01:17:03 +02:00
Fred Douglas	f43b1eb190	quantize : fix --keep-split check (#7374 )	2024-05-19 19:37:04 +03:00
0cc4m	924913a1b7	Vulkan Embedding Fix (#7360 ) * Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers	2024-05-19 17:19:53 +02:00
slaren	3265340345	ggml : fix another case of quants nans (#7387 )	2024-05-19 17:08:46 +02:00
Johannes Gäßler	59a38d4847	ggml: implement quantized KV cache for FA (#7372 )	2024-05-19 16:46:13 +02:00
Johannes Gäßler	a742a54fd0	server: add test for token probs (#7347 )	2024-05-19 16:26:02 +02:00
Johannes Gäßler	9ae757d0b5	server: fix seed being reported back (#7382 )	2024-05-19 17:06:33 +03:00
Anas Ahouzi	753bb58afa	Add StableLM2 pre-tokenizer (#7349 ) * Add StableLM pre-tokenizer * Fix space * Fix trailing whitespace	2024-05-19 22:46:46 +10:00
slaren	802b614cd9	cuda : clear error after buffer allocation failure (#7376 )	2024-05-19 14:19:37 +02:00
Brian	a846498a4a	labeler.yml: Use settings from ggerganov/llama.cpp [no ci] (#7363 ) https://github.com/actions/labeler#using-configuration-path-input-together-with-the-actionscheckout-action Recommends the use of checkout action to use the correct repo context when applying settings for PR labels e.g. steps: - uses: actions/checkout@v4 # Uploads repository content to the runner with: repository: "owner/repositoryName" # The one of the available inputs, visit https://github.com/actions/checkout#readme to find more - uses: actions/labeler@v5 with: configuration-path: 'path/to/the/uploaded/configuration/file'	2024-05-19 20:51:03 +10:00
Georgi Gerganov	beb87b0aed	cmake : update android comments (#7341 )	2024-05-19 11:01:01 +03:00
fraxy-v	64ae46a41c	Capture CUDA logging output (#7298 ) * logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-19 00:44:42 +02:00
Georgi Gerganov	cc5796c0ec	ci : re-enable sanitizer runs (#7358 ) * Revert "ci : temporary disable sanitizer builds (#6128)" This reverts commit `4f6d1337ca`. * ci : trigger	2024-05-18 18:55:54 +03:00
Georgi Gerganov	ae3045fe3f	android : use "ci-android" branch for CI (#7341 ) * android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir	2024-05-18 20:40:39 +10:00
Johannes Gäßler	6fe8769d65	CUDA: deduplicate FlashAttention code (#7352 )	2024-05-18 12:36:25 +02:00
Johannes Gäßler	97cd158809	server: correct --threads documentation [no ci] (#7362 )	2024-05-18 11:10:47 +02:00
Engininja2	faf3777e1c	cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263 )	2024-05-18 10:05:17 +02:00
Steffen Röcker	1e9bede474	llama : add support for larger Granite Code Models (20B, 34B) (#7324 ) Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`	2024-05-18 11:04:55 +03:00
strawberrymelonpanda	048941c1ee	perplexity : ndot progress and show stats with < 100 tasks (#7348 ) Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.	2024-05-18 10:57:08 +03:00
0cc4m	1c30af8886	Update and fix Vulkan soft_max and argsort implementations (#7237 ) * Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation	2024-05-18 08:10:58 +02:00
Brian	85733c54b1	github-actions-labeler: initial commit (#7330 ) * github-actions-labeler: initial commit [no ci] * github actions: remove priority auto labeling [no ci]	2024-05-18 16:04:23 +10:00
Georgi Gerganov	59f9af2239	convert : fix set_vocab_sentencepiece (#6866 ) * convert : fix set_vocab_sentencepiece * Update convert-hf-to-gguf.py	2024-05-18 08:46:20 +03:00
slaren	1e854ead0c	ggml : fix quants nans when all the group weights are very close to zero (#7313 )	2024-05-18 02:39:54 +02:00
Engininja2	d11ac19068	cmake : fix typo in AMDGPU_TARGETS (#7356 )	2024-05-18 02:39:25 +02:00

1 2 3 4 5 ...

2966 Commits