Commit Graph

3773 Commits

Author SHA1 Message Date
Iwan Kawrakow
409bfe6648 Remove what appears to be unnecessary asserts in ggml_cuda_cpy 2025-06-26 20:27:50 +03:00
Kawrakow
5236c98b41 CUDA: MMQ for iqX_r4 quants (#557)
* cuda: MMQ for iq2_k_r4

* cuda: MMQ for iq3_k_r4

* cuda: MMQ for iq4_k_r4

* cuda: MMQ for iq5_k_r4

* iqk_r4 quants: use MMQ only for batches < 1024 tokens

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-26 08:50:49 +02:00
Kawrakow
8e5106b20f Add Falcon-Edge support (#555)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-26 08:48:52 +02:00
Kawrakow
b5f2f00106 Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON (#553)
* iq1_s

66.3 t/s -> 168.8 t/s.

* iq1_m

19 t/s -> 163 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-24 14:21:37 +02:00
Kawrakow
64f6c2dead Much faster prompt processing for k-quants (ARM_NEON) (#552)
* iq2_xxs

55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s

* iq2_xs

46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.

* iq2_s

42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.

* iq3_xxs

51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.

* iq3_s

46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s

* q2_k

85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.

* q3_K

45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.

* q6_k

47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.

* q4_k

58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.

As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).

* q5_k

54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.

* iq4_xs

71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-24 13:05:01 +02:00
Kawrakow
ddda4d9e64 Much faster prompt processing for I-quants (ARM_NEON) (#550)
* iq2_xxs

55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s

* iq2_xs

46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.

* iq2_s

42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.

* iq3_xxs

51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.

* iq3_s

46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-23 15:50:24 +02:00
Kawrakow
4776dd2809 Much faster prompt processing for IQK quants (ARM_NEON) (#549)
* Faster GEMM fir iq2_ks, iq4_ks

* iq5_ks

63.8 t/s -> 166 t/s. iq5_ks_r4 is at 107.4 t/s.
But: iw5_ks_r4 TG performance is quite a bit better:
21.7 t/s vs 17.7 t/s for iq5_ks.

* iq6_k

44 t/s -> 164.3 t/s. There is no iq6_k_r4

* iq5_k

46 t/s -> 167 t/s. iq5_k_r4 is at 99.5 t/s.

* iq4_k

46.4 -> 167.2 t/s. iq4_k_r4 is at 115 t/s.

* iq3_k

47.3 t/s -> 166.5 t/s. iq3_k_r4 is at 96.5 t/s.

* iq2_k

47.4 t/s -> 167 t/s. iq2_k_r4 is at 113.3 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-23 11:55:50 +02:00
Iwan Kawrakow
cac763fc20 To use GGML_ABORT we need to include ggml-impl.h. 2025-06-22 17:49:32 +03:00
Iwan Kawrakow
22d6817d1e Abort if IQK_IMPLEMENT is not defined 2025-06-22 16:49:38 +03:00
Kawrakow
4f97409b80 Faster ARM_NEON GEMM implementation for legacy quants (#546)
* iq2_kt and iq3_kt work with new int trellis

Much slower than the fp16 based trellis. I guess, Apple doesn't
have int8_t SIMD on the M2-Max GPU.

* q4_0

83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s

* q5_0

74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.

* q6_0

74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.

* q8_0

84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.

* iq4_nl

84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s

* q4_1

74.4 -> 115.4 t/s. There is no repacked variant

* q5_1

64.2 t/s -> 114.9 t/s. There is no repacked variant.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-21 16:35:08 +02:00
Kawrakow
a98b7678a3 Perhaps slightly faster trellis quants (#541)
* This seems slightly faster for IQ2_KT, IQ3_KT TG

* This looks better for iq4_kt TG

* WIP

* Cleanup

* With fancy simd also set func16

* Enable next_128() also on AVX2

Despite having just 16 vector registers it is still faster.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-21 16:32:16 +02:00
Kawrakow
1843ed22c5 New integer trellis on ARM_NEON (#544)
* Adapt iq3_kt to new trellis on NEON

* iq3_kt is now working on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-20 09:26:36 +03:00
Kawrakow
144ee1c4c6 Fix NEON build (#542)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-19 18:37:22 +03:00
firecoperana
3f111ad7bb add dry sampler (#513)
* add dry sampler

* use vocab instead of model in dry_init function

* fix compile error for build test

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-19 10:24:53 +03:00
saood06
c5368148cf Minor readme update (#535)
* Condense CUDA implementations).

* move thing

* move thing

* move thing fix
2025-06-19 10:18:39 +03:00
Anton Sokolchenko
39e17589a2 Update CMakeLists.txt to fix NDEBUG handling (#537)
without my change

| PP  | TG  | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
| --- | --- | ---- | ------ | -------- | ------ | -------- |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to mul_mat_id
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
|  8192 |   2048 |      0 |   54.433 |   150.50 |  414.061 |     4.95 |
|  8192 |   2048 |   8192 |   64.162 |   127.68 |  428.767 |     4.78 |

after my change to CMakeLists.txt

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   58.363 |   140.36 |  405.040 |     5.06 |
|  8192 |   2048 |   8192 |   63.752 |   128.50 |  423.548 |     4.84 |
|  8192 |   2048 |  16384 |   69.712 |   117.51 |  431.367 |     4.75 |
2025-06-19 10:18:21 +03:00
Kawrakow
c6166b4020 Fix missed block_q8_x2 bf16 -> i16 change (#540)
Closes #538

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-19 09:35:36 +03:00
Louie Helm
0ade534305 Fix KT Neon / ARM typo (#536)
Removes errant ";" in front of 0xCBAC1FED in non-x86 code

```
error: expected primary-expression before ';' token
     constexpr static uint32_t ka = ;0xCBAC1FED;
                                    ^
error: expected unqualified-id before numeric constant
     constexpr static uint32_t ka = ;0xCBAC1FED;
                                    ^
```
2025-06-18 19:55:02 +03:00
Iwan Kawrakow
7479c2a3e5 Fix MSVC compilation error 2025-06-18 16:48:36 +03:00
Kawrakow
d85c64428e New IQ2_KT, IQ3_KT and IQ4_KT, V2 (#529)
* New iq4_kt trellis

The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.

* Something is not working with the AVX2 dot product

* New iq4_kt: CUDA MMVQ

* New iq4_kt: CUDA MMQ

* For now have only iq4_kt use the new trellis

* Fix iq2_kt that got broken along the way

* New iq4_kt: AVX2 dot product finally works

We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.

* New iq4_kt: fix vanilla AVX2

* New iq4_kt: NEON implementation

We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.

* New iq4_kt: slightly faster NEON

* New iq4_kt: slightly faster NEON

* New iq4_kt: faster NEON

We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.

* Minor

* New iq4_kt trellis: not working Metal implementation

* Remove the extra 4 bytes of row meta data that is no longer used

* Cleanup

* Adding forgottent file

* Switching iq2_kt to new trellis - CUDA MMQ

* New iq2_kt: CUDA GEMV

* New iq2_kt: AVX2 dequantize

* New iq2_kt: AVX2 GEMM/GEMV

* Adding forgotten file

* New iq2_kt: NEON GEMM/GEMV

* New iq2_kt: slightly faster NEON GEMM

* New iq2_kt: Metal - very slow.

It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.

* Add missing break

* Trying @louiehelm's multiplier

* CPU

* iq3_kt: use integer trellis + CUDA dequantize and MMVQ

* iq3_kt: MMQ

* iq3_kt: AVX2 GEMM

* iq3_kt: AVX2 GEMV

* The trellis quants now need super-blocks of 256, so we need a check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 16:20:54 +03:00
Kawrakow
c410cc72bb Much faster CPU prompt processing (part 3) (#534)
* Repack q4_0 and q8_0 to q8_0_R8

q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.

* Change q8_2_x4 to store in16_t sums

With that q4_0 now works.
I need to check all quants that use q8_2_x4!

* q5_0 and use a dequntizing template

* q6_0

129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.

* iq4_nl

137 t/s -> 293 t/s. iq4_nl is at 251 t/s.

* q4_1: 135 t/s -> 262 t/s

* q5_1: 125 t/s -> 253 t/s

* iq3_xs

178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.

* q2_K

202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 15:30:56 +03:00
Kawrakow
dc96820ddb Much faster CPU prompt processing (part 2) (#533)
* iq4_ks

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

* iq4_k

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.

PPL is actually lower!

* iq5_ks

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.

PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

* iq5_k - accuracy loss is too big

* iq5_k - there was a bug with the shifts

...and that's why PPL was so high. It is also high on main.
This fixes it.

* iq6_k

148 t/s -> 350 t/s. There is no iq6_k_r4

PPL is actually lower because we have a bug in the existing
implementation!

* iq3_k

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

* iq2_k

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

* iq2_ks

200 t/s -> 367 t/s. There is no iq2_ks_r4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 07:29:33 +03:00
Kawrakow
8b3002bba2 Send [DONE] for OAI compatibility (#470)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-17 10:32:53 +03:00
Kawrakow
0f8f8b32e2 Much faster CPU prompt processing (part 1) (#531)
* q6_K dequantizing GEMM

* Much easier: just use different vec_dot types!

* WIP

* Finally q6_K x q8_2_x4 dot product works

* Very slightly better

* We don't need the changes in ggml.c

* Fix AVX2

* iq2_xs

* Fix AVX2

* iq2_s

* q3_K

* Fix q8_k_r8 on Zen4

* q3_K: repack to q8_k_r8 instead of q8_0_r8

With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.

* q3_K: don't scale when all quants in a block are <= 127 when repacking

* iq2_s: repack to q8_k_r8 instead of q8_0_r8

* iq2_xs: rapck to q8_k_r8

* WIP

* iq2_xs: repack to q8_k_r8

* iq3_xxs: repack to q8_k_r8

* iq3_s: use q8_k_r8

* iq1_s: repack to q8_k_r8

* iq1_m: repack to q8_k_r8

* iq1_m: slightly faster

* Slightly faster

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-17 07:12:48 +03:00
Kawrakow
6fc5bbb657 Call iqk_convert_repack in MoE GEMM (#528)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-14 05:52:46 +03:00
Kawrakow
066ed4fd11 Faster CPU prompt processing for Q4_K and Q5_K (#525)
* q4_K: dequantize to q8_1_r8 for batch >= 32

We get 268 t/s, up from 186 t/s.

* q4_K: GEMM with q8_2_X4

* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

* Remove the scales, they are not needed

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-13 07:58:15 +03:00
saood06
f72983f7fe Update News section of readme (#510)
* Convert existing News to new format

* Update with new ones

* Add more links and minor fix

* more minor fixes

* requested changes

* Add old PRs

* Add more old PRs

* Add all IQK quants
2025-06-13 07:56:40 +03:00
Kawrakow
7a882f0b63 Perhaps a slightly better version for IQ2_XXS, IQ3_XXS, IQ3_S GEMV (#524)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-13 07:55:57 +03:00
Kawrakow
b57bd8658b Better strategy for GPU offload (#520)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-12 19:25:11 +03:00
firecoperana
7b1a3eece7 Add top n sigma sampler and other webui fix (#512)
Co-authored-by: firecoperana <firecoperana>
2025-06-12 08:19:26 +03:00
Kawrakow
4fc3cb4a47 iq3_s: much faster GEMM via repacking to q8_0_r8 (#518)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-12 08:16:12 +03:00
Kawrakow
3f54b49786 Faster iq1_s GEMM via repacking to Q8_0_R8 (#517)
TG is slightly faster too - 24.4 vs 23.1 t/s on the
Ryzen-5975WX

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 15:01:34 +03:00
Kawrakow
69af3f5990 Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2) (#516)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 13:05:26 +03:00
Kawrakow
e56061fa12 IQ2_XXS: much faster CPU prompt processing (#515)
* Much faster iq2_xxs GEMM

PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.

* iq2_xxs: q8_2_x4 GEMM

* iq2_xxs: use template for q8_2_x4 GEMM

* Fix AVX2

* Cleanup

* NEON is not working yet, so still use Q8_K GEMM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 11:12:30 +03:00
Gaolingx
3c1f2c68fd Fix Compile error (C2668) (#508)
* cmake: force MSVC compiler charset to utf-8

* build: apply MSVC /bigobj option to c/cpp files only

* Update CMakeLists.txt

* Fix Compile error (C2668)

* revert hsum_float_8x8
2025-06-10 08:30:17 +03:00
saood06
fa90a9864a Docs update (#509)
* use npm as deps manager and vite as bundler

* update XTC docs

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-06-09 05:32:03 -05:00
firecoperana
58f08e4385 Fix non rpc build error (#506)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

* fix rpc function not found error

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 17:27:00 +03:00
Iwan Kawrakow
1eabdb420b Revert "Rpc improvement (#480)"
This reverts commit 8a5f8573ae.
2025-06-08 14:49:50 +03:00
firecoperana
8a5f8573ae Rpc improvement (#480)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 14:43:21 +03:00
Kawrakow
63ef0a392b Update AUTHORS 2025-06-08 14:41:17 +03:00
firecoperana
df170c83a5 Webui improvement (#481)
* update webui

* add token/s in webui

* add webui files

* fix webui first message disappear in some browser

* add missing html files

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-08 14:38:47 +03:00
saood06
9e567e385a Add an endpoint that lists all the saved prompt caches to server (#502) 2025-06-07 00:22:56 -05:00
Kawrakow
8c1d5a2033 Fix #499 (#501)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-07 08:19:18 +03:00
saood06
ffd87f282e Make prompt cache saving and restoring MLA aware (#497)
* Remove kv_l, kvt_l and just use k_l and v_l

* Hopefully take care of missing V cache (MLA)

* Fix save and restore when there is no V cache

* Fix double print

* Update write_kv_cache_data and read_kv_cache_data to be MLA aware

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-06 11:33:47 +03:00
Kawrakow
eded4e20d4 IQ1_M_R4 CUDA implementation (#494)
* iq1_m_r4: CUDA dequantize

* iq1_m_r4: CUDA dequantize

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 19:13:51 +03:00
Kawrakow
8ffad187ab MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)
* MMQ for iq4_ks_r4

* MMQ for iq5_ks_r4

* Add forgotten file

* Another forgotten file

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:31:20 +03:00
Kawrakow
0b10f7418f Faster CPU prompt processing for Trellis quants and MoE models (#488)
* Also do the dequantize approach for mul_mat_id

* Also do the dequantize approach for iqk_moe_fused_up_gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:30:35 +03:00
Kawrakow
7e79665a31 CUDA implementation for IQ1_S_R4 (#492)
* iq1_s_r4: CUDA dequantize

* iq1_s_r4: CUDA GEMV

* iq1_s_r4: MMQ on CUDA

Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 07:24:31 +03:00
Kawrakow
f6d5fbdc57 Adding top-n-sigma sampler (#489)
* Adding top-n-sigma sampler

* Fix typos in XTC PR

* Update README.md for main and server

* More README

* More README

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 17:35:09 +03:00
Kawrakow
ccb265c016 Adding the XTC sampler (#486)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 11:32:03 +03:00