ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 15:14:10 +00:00

Author	SHA1	Message	Date
Kawrakow	22863cf9c9	Be able to set a max. number of GPUs to be used in split mode graph (#1051 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:22:53 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00
Kawrakow	668c37d4cf	DeepSeek: enable option to merge Q and K tensors (#941 ) * Merge Q and K for DeepSeek * Formatting --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 08:23:04 +02:00
firecoperana	e15a215e6b	model : Port Minimax M2 from mainline (#907 ) Co-authored-by: firecoperana <firecoperana>	2025-11-06 18:09:24 +02:00
Kawrakow	cb30f8e057	Merge Q and K into a single tensor (#892 ) * Merge Q and K into a single tensor * Make V mul mat follow QK mul mat so they can be fused, which gives a slightly bbetter TG performance. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:54:36 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	37c4d19021	Compiler warning	2025-10-31 14:58:00 +02:00
firecoperana	a3bd0158f7	Disable pipeline parallel for tensor override or allocation failed (#879 ) * disable pipeline parallelism when tensor override present * disable pipeline parallel if allocation failed --------- Co-authored-by: firecoperana <firecoperana>	2025-10-31 14:20:48 +02:00
firecoperana	904e994bfb	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	ba9fefb73d	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	4daff01b39	Refactor file llama.cpp (#823 ) * llama_model and llama_hparams * llama_build_context Surprisingly small reduction in llama.cpp compile time given the reduction in LOCs (22k -> 14k) * LLM_TN llama.cpp compilation: 50 s -> 33 s * llama_quantize * arch names * All graph building is now in llm-build-context.cpp * hparams loading llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile. * We are now at 6 seconds to build the src folder * load -> create We are not actually loading the tensors, but just creating them. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 11:35:20 +03:00

13 Commits