ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	ccdb948329	Offload Bitnet token embeddings to the GPU - the right way OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never.	2024-07-26 13:50:41 +03:00
Kawrakow	94b5916319	Offload Bitnet token embeddings to the GPU (#1 ) * bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-26 09:41:04 +02:00
Iwan Kawrakow	99119ec29c	When tokenizer info is missing in the model, use llama3 by default	2024-07-19 12:29:01 +03:00
Iwan Kawrakow	71725a918f	bitnet: fold V scale into rms_norm	2024-06-26 12:05:57 +02:00
Iwan Kawrakow	767bce7caf	Typo	2024-06-25 19:17:14 +03:00
Iwan Kawrakow	707d087927	Bitnet: tiny bity faster 1.625 bpw variant on Metal We get 70.7 t/s for TG-128 vs 69.5 t/s before.	2024-06-24 16:42:30 +02:00
Iwan Kawrakow	8c936e3d65	bitnet: replace ggml_mul with ggml_scale to apply the scales Also save one scale operation in the ffn network by adjusting rms_eps. We gain up to 3% in performance by doing this, but it is a bit of a hack (we store the tensor scales in op_params while loading the model).	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	36374ab37d	bitnet(scale in a separate tensor): mul -> scale on the CPU	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	d08ff0df43	Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale" This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	ad60fb3567	bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	58d9e8f1d2	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f6863cfa1b	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0f53bc30bb	bitnet: CUDA, scalar, AVX2	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f20b28558b	bitnet: python + llama	2024-06-22 12:02:51 +03:00
Georgi Gerganov	a927b0f3dd	llama : optimize long word tokenization with WPM (#8034 ) ggml-ci	2024-06-21 08:51:28 +03:00
Douglas Hanley	80ea089d77	llama : allow pooled embeddings on any model (#7477 ) * create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling	2024-06-21 08:38:22 +03:00
jaime-m-p	37bef89433	tokenizer : BPE fixes (#7530 ) * Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t	2024-06-18 18:40:52 +02:00
Ștefan-Gabriel Muscalu	a94e6ff877	update: support Qwen2-57B-A14B (#7835 ) * update: convert-hf-to-gguf.py to support Qwen2-57B-A14B * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shared_exp are now properly calculated * update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shexp are now properly calculated	2024-06-17 21:08:46 +02:00
Georgi Gerganov	7c26775adb	llama : disable FA if KV head size do not match (#7982 )	2024-06-17 19:40:01 +03:00
Frank Mai	c637fcd34d	fix: divide 0 exception in mamba (#7932 ) Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-06-17 16:11:08 +02:00
Markus Tavenrath	6a2f0b3474	Implement non-mapped async IO for CUDA on Windows. (#7896 ) * Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive. * Free resources except for backend. * Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA. * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * Fix editorconfig and unused variable * Fix issues with Windows build --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-17 16:10:15 +02:00
Georgi Gerganov	52399254b3	unicode : avoid char32_t (#7957 ) ggml-ci	2024-06-16 14:51:40 +03:00
Meng, Hengyu	7b2f4a7d19	[SYCL] remove global variables (#7710 ) * separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related	2024-06-15 14:05:10 +08:00
Sigbjørn Skjæret	6fcd1331ef	llama : more checks before assuming FIM tokens (#7644 ) * More checks before assuming FIM tokens for Llama arch * extensive token check	2024-06-14 13:20:04 +03:00
Elaine	41b9260f18	convert : add Poro-34B-chat tokenizer support (#7713 ) * support for Poro chat pre-tokenizer * add support for Poro pre-tokenizer * Update convert-hf-to-gguf-update.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Change Poro-34B-chat to poro-chat * Change Poro-34B-chat to poro-chat * Update convert-hf-to-gguf-update.py * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-14 13:16:49 +03:00
slaren	f578b86b21	move BLAS to a separate backend (#6210 ) * move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-13 03:11:35 +02:00
slaren	c9ee7118d5	check for nans in imatrix and quantize (#7807 ) * imatrix : detect nan/inf values * quantize : check imatrix for nan/inf values	2024-06-07 09:01:29 +03:00
Clint Herron	ad675e1c67	Added support for . (any character) token in grammar engine. (#6467 ) * Added support for . (any characer) token in grammar engine. * Add integration tests for any-character symbol.	2024-06-06 06:08:52 -07:00
Joan Fontanals	f5d7b268ec	llama : add jina v2 base code (#7596 ) * feat: add changes to handle jina v2 base code * fix: do not complicate things * fix: fix the usage of the code model * fix: fix comments * fix: fix linting issues * fix: remove ollama patches * style : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-06 10:22:41 +03:00
Georgi Gerganov	2b3389677a	ggml : refactor rope norm/neox (#7634 ) * ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci	2024-06-05 11:29:20 +03:00
Georgi Gerganov	1442677f92	common : refactor cli arg parsing (#7675 ) * common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params	2024-06-04 21:23:39 +03:00
Georgi Gerganov	554c247caf	ggml : remove OpenCL (#7735 ) ggml-ci	2024-06-04 21:23:20 +03:00
Georgi Gerganov	0cd6bd3483	llama : remove beam search (#7736 )	2024-06-04 21:23:05 +03:00
jaime-m-p	3b38d48609	Per token attributes (#7685 ) * Add per token attributes enum * Using phi-3 for testing 'rstrip' * Using jina-v2 for testing 'lstrip' * Brute force test for 'lstrip' and 'rstrip' * Implement 'rstrip' and 'lstrip' * Update phi-3 GGUF file (obsolete since `917dc8c`) * Replace llama_token_type with llama_token_attribs	2024-06-04 09:17:17 +02:00
Radoslav Gerganov	bde7cd3cd9	llama : offload to RPC in addition to other backends (#7640 ) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-03 20:03:26 +03:00
0cc4m	3d7ebf6312	Vulkan Mixture of Experts (MoE) support (#7628 ) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU	2024-06-03 10:59:14 +02:00
zhangkaihuo	6f28a333c1	llama : MiniCPM support tied embeddings (#7664 ) * support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>	2024-06-03 10:49:30 +03:00
Georgi Gerganov	549279d804	llama : avoid double token-to-piece cache (#7654 ) ggml-ci	2024-06-03 08:34:43 +03:00
Johannes Gäßler	9b596417af	CUDA: quantized KV support for FA vec (#7527 ) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-01 08:44:14 +02:00
Georgi Gerganov	5921b8f089	llama : cache llama_token_to_piece (#7587 ) * llama : cache llama_token_to_piece ggml-ci * llama : use vectors and avoid has_cache ggml-ci * llama : throw on unknown tokenizer types ggml-ci * llama : print a log of the total cache size	2024-05-31 02:01:41 +10:00
Georgi Gerganov	fb76ec31a9	ggml : fix YARN + add tests + add asserts (#7617 ) * tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci	2024-05-29 20:17:31 +03:00
jaime-m-p	02c1ecad07	Tokenizer WPM fixes (#7500 ) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.	2024-05-28 21:46:34 +02:00
Giuseppe Scrivano	5442939fcc	llama : support small Granite models (#7481 ) * Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/llama.cpp/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Steffen Roecker <sroecker@redhat.com>	2024-05-28 21:49:49 +03:00
fairydreaming	ee3dff6b8e	Add support for DeepseekV2ForCausalLM (#7519 ) * common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-28 17:07:05 +02:00
Georgi Gerganov	8b99e2aa66	llama : handle unknown utf8 bytes (#7588 )	2024-05-28 13:55:35 +03:00
Bartowski	c429b33beb	llama : add Smaug 70B support (#7402 )	2024-05-26 15:28:35 +03:00
Justine Tunney	00c6390793	main : don't print special tokens with --grammar (#6923 ) * main : don't print special tokens with --grammar The CLI interface was recently changed to print special control tokens like the </s> stop message one. This token shouldn't be printed if the grammar flag was passed, unless the grammar specifies it, because that breaks shell-scriptability. * main: use seperate stream for control characters * main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out * main: dprintf isn't part of the IEEE POSIX standard. Just use write(). * main: remove --ctrl-token-fd-out in favor for fcntl() based detection * common.cpp: accidentally removed --interactive-first * main: only merge stdout and control token if not in conversation or grammar mode * main: rejig control token descriptor handling * main: must check pipe status on very top of program * main: renamed --no-special from --ctrl-token-no-out and other refactoring * main: refactor ctrl_token_no_out --> no_special * llama: rename llama_token_is_control_token() to llama_token_is_control() * main: remove special token file descriptor feature (#5) --------- Co-authored-by: Brian <mofosyne@gmail.com>	2024-05-25 19:04:03 +10:00
Masaya, Kato	faa0e6979a	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433 ) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-05-25 11:42:31 +03:00
fairydreaming	fbca2f27fc	Add support for ArcticForCausalLM (#7020 ) * common : increase max number of experts to 128 * common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn * gguf-py : add architecture-specific block mappings that override selected general block mappings * convert-hf : add model conversion support for ArcticForCausalLM * convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM) * llama : add inference support for LLM_ARCH_ARCTIC --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-24 14:31:13 +02:00
Tristan Druyen	007489e895	Fix phi3 chat template confusion with zephyr (#7449 ) * Fix phi3 template matching vs zephyr * Add regression test for new phi3 chat template * Implement review suggestions * Fix phi3 jinja test templates & match by <\|end\|> * Apply suggestion Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Add all phi3 template variants in tests * Remove unneeded message trimming Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Fix tests to not expect trimmed messages --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-05-23 16:15:15 +02:00

1 2 3 4 5 ...

669 Commits