ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-27 01:49:28 +00:00

Author	SHA1	Message	Date
Kawrakow	4bcfb40711	This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined.	2025-11-30 18:05:13 +00:00
Kawrakow	97143330a1	This works, but it is slow	2025-11-30 18:05:13 +00:00
Kawrakow	32c6df015b	WIP	2025-11-30 18:05:12 +00:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00
Kawrakow	668c37d4cf	DeepSeek: enable option to merge Q and K tensors (#941 ) * Merge Q and K for DeepSeek * Formatting --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 08:23:04 +02:00
firecoperana	e15a215e6b	model : Port Minimax M2 from mainline (#907 ) Co-authored-by: firecoperana <firecoperana>	2025-11-06 18:09:24 +02:00
Kawrakow	cb30f8e057	Merge Q and K into a single tensor (#892 ) * Merge Q and K into a single tensor * Make V mul mat follow QK mul mat so they can be fused, which gives a slightly bbetter TG performance. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:54:36 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	37c4d19021	Compiler warning	2025-10-31 14:58:00 +02:00
firecoperana	a3bd0158f7	Disable pipeline parallel for tensor override or allocation failed (#879 ) * disable pipeline parallelism when tensor override present * disable pipeline parallel if allocation failed --------- Co-authored-by: firecoperana <firecoperana>	2025-10-31 14:20:48 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	ba9fefb73d	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	4daff01b39	Refactor file llama.cpp (#823 ) * llama_model and llama_hparams * llama_build_context Surprisingly small reduction in llama.cpp compile time given the reduction in LOCs (22k -> 14k) * LLM_TN llama.cpp compilation: 50 s -> 33 s * llama_quantize * arch names * All graph building is now in llm-build-context.cpp * hparams loading llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile. * We are now at 6 seconds to build the src folder * load -> create We are not actually loading the tensors, but just creating them. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 11:35:20 +03:00

14 Commits