mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-03 04:57:32 +00:00

Go to file

firecoperana ed9e8ecc9b Rpc improvement (#480 )

* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>

2025-06-08 14:43:21 +03:00

.devops

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.github

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

common

Rpc improvement (#480 )

2025-06-08 14:43:21 +03:00

docs

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

examples

Rpc improvement (#480 )

2025-06-08 14:43:21 +03:00

ggml

Rpc improvement (#480 )

2025-06-08 14:43:21 +03:00

gguf-py

convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483 )

2025-06-03 09:30:30 +03:00

grammars

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Adding top-n-sigma sampler (#489 )

2025-06-03 17:35:09 +03:00

media

README: add graphic for matrix multiplication (#6881 )

2024-04-24 21:29:13 +02:00

models

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pocs

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

prompts

llama : add Qwen support (#4281 )

2023-12-01 20:16:31 +02:00

requirements

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

scripts

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

spm-headers

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

src

Rpc improvement (#480 )

2025-06-08 14:43:21 +03:00

tests

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

.dockerignore

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

.ecrc

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

.editorconfig

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.flake8

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

.gitignore

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.gitmodules

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AUTHORS

Update AUTHORS

2025-06-08 14:41:17 +03:00

CMakeLists.txt

Rpc improvement (#480 )

2025-06-08 14:43:21 +03:00

CMakePresets.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CONTRIBUTING.md

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

convert_hf_to_gguf_update.py

Deepseek V3 support added (#176 )

2025-01-23 18:24:10 +02:00

convert_hf_to_gguf.py

convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483 )

2025-06-03 09:30:30 +03:00

convert_llama_ggml_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

convert_lora_to_gguf.py

Fix missing rope_freqs with convert_hf_to_gguf (#402 )

2025-05-09 09:17:41 -05:00

flake.lock

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

flake.nix

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

LICENSE

Use links for ggml/llama.cpp authors (#318 )

2025-04-07 17:25:06 +02:00

Makefile

Enable q6_0 for flash attention (#101 )

2024-10-22 11:34:49 +02:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

Package.swift

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

poetry.lock

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyproject.toml

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyrightconfig.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

Update README.md

2025-05-12 15:48:37 +03:00

requirements.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

ik_llama.cpp: llama.cpp fork with better CPU performance

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
May 9 2025: Support for LlaMA-3-Nemotron models added, see PR 377
May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
April 29 2025: Qwen3 support added, see PR 355
April 26 2025: GLM-4 support added, see PR 344
April 26 2025: Command-A support added, see PR 341
April 22 2025: Support for the latest Microsoft Bitnet model added, see PR 337
April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see PR 332
April 13 2025: IQ1_M quantization improvements, see PR 327
April 10 2025: LLaMA-4 support added, see PR 321. In the PR there are also some custom quantization recipes for L4-Scout provided.
April 7 2025: IQ2_XS quantization improvements, see PR 312
April 3 2025: 🚀 Much faster MoE implementation on Metal, see PR 307
April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1, see PR 302
March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL, see PR 295
March 25 2025: 🚀 Better MoE performance on CUDA
March 23 2025: 🚀 Better batched processing speed for DeepSeek models
March 22 2025: Gemma3 support added
March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models
March 18 2025: Reduce compute buffer size
March 17 2025: 🚀 FlashMLA-2 performance improvements
March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
March 10 2025: 🚀 Better TG performance for MoE models on CUDA
March 9 2025: 🚀 FlashMLA on CUDA
March 8 2025: 🚀 Faster FlashMLA CPU implementation
March 7 2025: Custom quantization mixes using regular expressions
March 5 2025: 🚀 FlashMLA on CUDA
March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
March 1 2025: Smart Expert Reduction for faster DeepSeek inference
Feb 27 2025: MLA without transposed cache
Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
Feb 23 2025: sweep-bench - better performance benchmarking
Feb 20 2025: 🚀 Fast GEMM/GEMV for IQ1_S
Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
Feb 13 2025: Allow Q8_0 quantized cache with MLA
Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
Feb 9 2025: 🚀 MLA for DeepSeek models
Jan 23 2025: DeepSeek-V3 support added

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

The Wiki page has performance comparisons to mainline llama.cpp
This guide is a good place to start if you came here because of DeepSeek models
This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT

Languages

C++ 55.4%

C 16.4%

Cuda 14%

Python 5.5%

Metal 3%

Other 5.6%