Files
ik_llama.cpp/github-data/pull_requests/193 - RPC sync.md
2025-07-23 13:31:53 +02:00

3.1 KiB

🔀 #193 - RPC sync

Author saood06
State Closed
Created 2025-02-08
Updated 2025-06-15

Description

I grabbed all of the changes needed for llama.cpp/pull/11047 , which was https://github.com/ggerganov/llama.cpp/pull/9912 and https://github.com/ggerganov/llama.cpp/pull/9040

This compiles, but has not been tested yet.


💬 Conversation

👤 ikawrakow commented the 2025-02-08 at 13:23:08:

I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing.


👤 saood06 commented the 2025-02-10 at 16:40:34:

@jukofyork

I strongly suspect something funky is going on

There is, see this comment: https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2625090660

This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla), this PR should allow RPC to work, and I'm working on porting the add option to override model tensor buffers.


👤 saood06 commented the 2025-02-27 at 23:11:54:

This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp.


👤 saood06 submitted a review the 2025-02-27 at 23:14:23: 💬 COMMENTED


👤 saood06 commented during a code review the 2025-02-27 at 23:14:23 on ggml/src/ggml-rpc.cpp:

The RPC client crashes here, which happens as the RPC server hits an issue.


👤 saood06 submitted a review the 2025-02-27 at 23:17:32: 💬 COMMENTED


👤 saood06 commented during a code review the 2025-02-27 at 23:17:32 on ggml/src/ggml-rpc.cpp:

I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.


👤 saood06 commented the 2025-04-12 at 04:39:37:

@saood06

I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing.

I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.

Thanks for the link, it is interesting. I think it would work for dense models but not as well for MoE because as far as I can tell it doesn't handle -ot (this commit looks relevant) . I'd also need windows support which is on the roadmap (but I might see what the issue is by trying to build it on my machine, and see if I can fix it), and the GPU machine has to run windows (my big RAM box runs clear linux, and I have other servers that run FreeBSD and Proxmox).


👤 saood06 commented the 2025-06-15 at 11:26:50:

Closed as superseded by #480 / #506