518 Commits

Author SHA1 Message Date
Kawrakow
1d6e143ecb Fix NEON build (#542)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-19 18:37:22 +03:00
Anton Sokolchenko
d595dfaa2f Update CMakeLists.txt to fix NDEBUG handling (#537)
without my change

| PP  | TG  | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
| --- | --- | ---- | ------ | -------- | ------ | -------- |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to mul_mat_id
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
|  8192 |   2048 |      0 |   54.433 |   150.50 |  414.061 |     4.95 |
|  8192 |   2048 |   8192 |   64.162 |   127.68 |  428.767 |     4.78 |

after my change to CMakeLists.txt

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   58.363 |   140.36 |  405.040 |     5.06 |
|  8192 |   2048 |   8192 |   63.752 |   128.50 |  423.548 |     4.84 |
|  8192 |   2048 |  16384 |   69.712 |   117.51 |  431.367 |     4.75 |
2025-06-19 10:18:21 +03:00
Kawrakow
08d2100c07 Fix missed block_q8_x2 bf16 -> i16 change (#540)
Closes #538

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-19 09:35:36 +03:00
Louie Helm
4fa4aeec5d Fix KT Neon / ARM typo (#536)
Removes errant ";" in front of 0xCBAC1FED in non-x86 code

```
error: expected primary-expression before ';' token
     constexpr static uint32_t ka = ;0xCBAC1FED;
                                    ^
error: expected unqualified-id before numeric constant
     constexpr static uint32_t ka = ;0xCBAC1FED;
                                    ^
```
2025-06-18 19:55:02 +03:00
Kawrakow
87105fc3b4 Fix MSVC compilation error 2025-06-18 16:48:36 +03:00
Kawrakow
d345a15a84 New IQ2_KT, IQ3_KT and IQ4_KT, V2 (#529)
* New iq4_kt trellis

The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.

* Something is not working with the AVX2 dot product

* New iq4_kt: CUDA MMVQ

* New iq4_kt: CUDA MMQ

* For now have only iq4_kt use the new trellis

* Fix iq2_kt that got broken along the way

* New iq4_kt: AVX2 dot product finally works

We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.

* New iq4_kt: fix vanilla AVX2

* New iq4_kt: NEON implementation

We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.

* New iq4_kt: slightly faster NEON

* New iq4_kt: slightly faster NEON

* New iq4_kt: faster NEON

We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.

* Minor

* New iq4_kt trellis: not working Metal implementation

* Remove the extra 4 bytes of row meta data that is no longer used

* Cleanup

* Adding forgottent file

* Switching iq2_kt to new trellis - CUDA MMQ

* New iq2_kt: CUDA GEMV

* New iq2_kt: AVX2 dequantize

* New iq2_kt: AVX2 GEMM/GEMV

* Adding forgotten file

* New iq2_kt: NEON GEMM/GEMV

* New iq2_kt: slightly faster NEON GEMM

* New iq2_kt: Metal - very slow.

It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.

* Add missing break

* Trying @louiehelm's multiplier

* CPU

* iq3_kt: use integer trellis + CUDA dequantize and MMVQ

* iq3_kt: MMQ

* iq3_kt: AVX2 GEMM

* iq3_kt: AVX2 GEMV

* The trellis quants now need super-blocks of 256, so we need a check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 16:20:54 +03:00
Kawrakow
129b58b150 Much faster CPU prompt processing (part 3) (#534)
* Repack q4_0 and q8_0 to q8_0_R8

q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.

* Change q8_2_x4 to store in16_t sums

With that q4_0 now works.
I need to check all quants that use q8_2_x4!

* q5_0 and use a dequntizing template

* q6_0

129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.

* iq4_nl

137 t/s -> 293 t/s. iq4_nl is at 251 t/s.

* q4_1: 135 t/s -> 262 t/s

* q5_1: 125 t/s -> 253 t/s

* iq3_xs

178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.

* q2_K

202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 15:30:56 +03:00
Kawrakow
57d283de02 Much faster CPU prompt processing (part 2) (#533)
* iq4_ks

203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.

* iq4_k

175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.

PPL is actually lower!

* iq5_ks

180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.

PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct

* iq5_k - accuracy loss is too big

* iq5_k - there was a bug with the shifts

...and that's why PPL was so high. It is also high on main.
This fixes it.

* iq6_k

148 t/s -> 350 t/s. There is no iq6_k_r4

PPL is actually lower because we have a bug in the existing
implementation!

* iq3_k

169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.

* iq2_k

190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.

* iq2_ks

200 t/s -> 367 t/s. There is no iq2_ks_r4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 07:29:33 +03:00
Kawrakow
23ac643459 Much faster CPU prompt processing (part 1) (#531)
* q6_K dequantizing GEMM

* Much easier: just use different vec_dot types!

* WIP

* Finally q6_K x q8_2_x4 dot product works

* Very slightly better

* We don't need the changes in ggml.c

* Fix AVX2

* iq2_xs

* Fix AVX2

* iq2_s

* q3_K

* Fix q8_k_r8 on Zen4

* q3_K: repack to q8_k_r8 instead of q8_0_r8

With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.

* q3_K: don't scale when all quants in a block are <= 127 when repacking

* iq2_s: repack to q8_k_r8 instead of q8_0_r8

* iq2_xs: rapck to q8_k_r8

* WIP

* iq2_xs: repack to q8_k_r8

* iq3_xxs: repack to q8_k_r8

* iq3_s: use q8_k_r8

* iq1_s: repack to q8_k_r8

* iq1_m: repack to q8_k_r8

* iq1_m: slightly faster

* Slightly faster

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-17 07:12:48 +03:00
Kawrakow
bb8ff4e9f0 Call iqk_convert_repack in MoE GEMM (#528)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-14 05:52:46 +03:00
Kawrakow
b7768e203f Faster CPU prompt processing for Q4_K and Q5_K (#525)
* q4_K: dequantize to q8_1_r8 for batch >= 32

We get 268 t/s, up from 186 t/s.

* q4_K: GEMM with q8_2_X4

* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

* Remove the scales, they are not needed

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-13 07:58:15 +03:00
Kawrakow
fb30146ce8 Perhaps a slightly better version for IQ2_XXS, IQ3_XXS, IQ3_S GEMV (#524)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-13 07:55:57 +03:00
Kawrakow
dc663fe632 Better strategy for GPU offload (#520)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-12 19:25:11 +03:00
Kawrakow
baa412aa42 iq3_s: much faster GEMM via repacking to q8_0_r8 (#518)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-12 08:16:12 +03:00
Kawrakow
bbf3f20df1 Faster iq1_s GEMM via repacking to Q8_0_R8 (#517)
TG is slightly faster too - 24.4 vs 23.1 t/s on the
Ryzen-5975WX

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 15:01:34 +03:00
Kawrakow
c9fd42520e Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2) (#516)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 13:05:26 +03:00
Kawrakow
bdf1b34493 IQ2_XXS: much faster CPU prompt processing (#515)
* Much faster iq2_xxs GEMM

PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.

* iq2_xxs: q8_2_x4 GEMM

* iq2_xxs: use template for q8_2_x4 GEMM

* Fix AVX2

* Cleanup

* NEON is not working yet, so still use Q8_K GEMM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-11 11:12:30 +03:00
Gaolingx
b2a4df3590 Fix Compile error (C2668) (#508)
* cmake: force MSVC compiler charset to utf-8

* build: apply MSVC /bigobj option to c/cpp files only

* Update CMakeLists.txt

* Fix Compile error (C2668)

* revert hsum_float_8x8
2025-06-10 08:30:17 +03:00
firecoperana
d1ae1504c6 Fix non rpc build error (#506)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

* fix rpc function not found error

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 17:27:00 +03:00
Kawrakow
7fc39ae7e8 Revert "Rpc improvement (#480)"
This reverts commit 8a5f8573ae.
2025-06-08 14:49:50 +03:00
firecoperana
ed9e8ecc9b Rpc improvement (#480)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 14:43:21 +03:00
Kawrakow
a214d22864 Fix #499 (#501)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-07 08:19:18 +03:00
Kawrakow
1b9b7bd5ad IQ1_M_R4 CUDA implementation (#494)
* iq1_m_r4: CUDA dequantize

* iq1_m_r4: CUDA dequantize

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 19:13:51 +03:00
Kawrakow
3e70d984dd MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)
* MMQ for iq4_ks_r4

* MMQ for iq5_ks_r4

* Add forgotten file

* Another forgotten file

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:31:20 +03:00
Kawrakow
636d97fefa Faster CPU prompt processing for Trellis quants and MoE models (#488)
* Also do the dequantize approach for mul_mat_id

* Also do the dequantize approach for iqk_moe_fused_up_gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:30:35 +03:00
Kawrakow
631ceee073 CUDA implementation for IQ1_S_R4 (#492)
* iq1_s_r4: CUDA dequantize

* iq1_s_r4: CUDA GEMV

* iq1_s_r4: MMQ on CUDA

Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 07:24:31 +03:00
Kawrakow
67b7ed49a4 Minor (~2%) iq2_ks TG performance improvement on CUDA (#468)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:24:33 +03:00
Kawrakow
b0bae0f0fa Trellis quants: faster CPU prompt processing (#482)
* Experimenting with dequant + f32 GEMM

For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.

* Experimenting with dequant + f32 GEMM

iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s

* Experimenting with dequant + f16 GEMM on NEON

iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s

Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.

* Experimenting with dequant + f16 GEMM on NEON

iq4_kt: PP512 = 86 t/s from 29 t/s

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:24:05 +03:00
Kawrakow
8d8f32b994 Metal implementatio for the trellis quants. (#475)
* iq2_kt: Metal dequantize

* iq2_kt: Metal GEMV

Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B

* iq3_kt: Metal dequantize

* iq3_kt: Metal GEMV

Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.

* iq4_kt: Metal dequantize - getting NaNs

* iq4_kt: Metal GEMV - also not working

* iq4_kt: Metal still not working

* Disable iq4_kt on Metal for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01 15:23:44 +03:00
Kawrakow
0dea2e8a81 NEON implementation for trellis quants (#471)
* iq2_kt: NEON implementation

* iq3_kt: NEON implementation

* iq4_kt: not working NEON implementation

* iq4_kt: NEON implementation

Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.

* Cleanup

* iq4_kt: slightly faster TG on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-29 18:57:41 +03:00
Kawrakow
6989ca0249 CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462)
* CUDA: iq4_ks_r4 GEMV and GEMM

* CUDA: iq5_ks_r4 GEMV and GEMM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-27 08:37:44 +03:00
Kawrakow
89728ab03c CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461)
* CUDA: iq4_k_r4 dequantize

* CUDA: iq4_k_r4 GEMV

~10% slower than iq4_k.

* CUDA: slightly faster iq4_k_r4 GEMV

* CUDA: slightly faster iq4_k_r4 GEMV

We are now within 3% of iq4_k

* CUDA: iq5_k_r4 dequantize

* CUDA: iq5_k_r4 GEMV

~3% slower than iq5_k.

* CUDA: iq3_k_r4 dequantize

* CUDA: iq3_k_r4 GEMV

* CUDA: slightly faster iq3_k_r4 GEMV

* CUDA: iq2_k_r4 GEMV

* CUDA: faster iq2_k_r4 GEMV

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-26 19:34:54 +03:00
Nexes the Elder
86170b2048 Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)
* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
2025-05-24 11:49:10 +03:00
Kawrakow
82645c4be7 Faster IQ3_KT and IQ4_KT (#453)
* Somewhat faster iq3_kt (AVX2)

* Cleanup

* Slightly faster iq4_kt

* Slightly faster iq4_kt

PP is now almost 50% better than original, TG is ~20% better

* Cleanup

* Very slightly faster iq4_kt TG

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-24 11:48:52 +03:00
Kawrakow
2994447021 Fix bug in MMVQ kernel (#446)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 18:25:11 +03:00
Kawrakow
2440eca319 Fix MSVC compilation (#448)
* Fix MSVC compilation

* MSVC cannot capture constexpr in lambdas

* Arghhh

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 16:46:27 +03:00
Andrew Chan
25d34e3d2f Trellis quants with CPU inference (#441)
* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 09:17:52 +03:00
Kawrakow
1683e8e720 Refactor iqk_mul_mat.cpp (#435)
* Refactor iqk: WIP

* Refactor iqk: Factor out float GEMM (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

* Refactor iqk: Factor out GEMM for repacked legacy quants

* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

* Refactor iqk: Factor out GEMM for repacked i-quants

* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

* Refactor iqk: factor out 1-bit quants (NEON)

* Refactor iqk: factor out k-quants (NEON)

* Refactor iqk: factor out floats (NEON)

* Also iq4_xs belongs to k-quants

* Refactor iqk: factor out iqk quants (NEON)

* Refactor iqk: factor out legacy quants (NEON)

* Refactor iqk: factor out repacked legacy quants (NEON)

* Refactor iqk: factor out repacked k-quants (NEON)

* Refactor iqk: factor out repacked iqk quants (NEON)

* Refactor iqk: GEMM kernels are refactored on NEON

* Refactor iqk: FA compiles

If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X

* Refactor iqk: FA refactored (Zen4)

Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.

* Adding forgotten file

* Most helpers don't need to be templates

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX

* Fix bf16

* Refactor iqk: FA refactored (NEON)

* Forgotten MMQ ref and typo (#431)

* Adding forgotten iq5_k_r4

* Fix iq4_k_r4 on NEON

* Fix iq4_ks on NEON

It was broken before the refactoring (the shifts were not correctly
applied).

* Fix q8_0 on NEON

* Fix q6_0 K cache

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
2025-05-22 10:05:51 +03:00
Kawrakow
dcc5ab31f1 Bug fixes from mainline (#439)
* Add __syncthreads() to the new FA kernel

* Clearing padding

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-20 17:03:14 +03:00
Nexes the Elder
e19ecd296b Forgotten MMQ ref and typo (#431) 2025-05-18 17:36:41 +03:00
Kawrakow
7e0ac477b8 Option to enable disable the IQK CPU FA kernels (#429)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 11:21:58 +03:00
Kawrakow
634da2f0c9 Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)
* Zen4: faster PP for iq4_ks and iq5_ks

* Zen4: faster PP for iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 10:42:33 +03:00
Kawrakow
db111c91ee IQ5_KS_R4: row-interleaved IQ5_KS (#426)
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 08:57:26 +03:00
Kawrakow
e31ba05fcd Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)
* Fix IQ4_K on AVX2

* Fix IQ4_KS on AVX2

* Fix IQ5_K on AVX2

* Fix IQ6_K on AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16 17:25:15 +03:00
Kawrakow
06532ebd0e Adding forgotten template instance for iq5_ks (#424)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:50:15 +03:00
Kawrakow
90e53a0b8b Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
17d721820a Fix standard attention on the CPU (#421)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:43:39 +03:00
Kawrakow
5e31a7df43 CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)
* MMQ for iq2_k

* This works

* MMQ for iq3_k

* MMQ for iq2_ks

* Fix iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:15:08 +03:00
Kawrakow
51db1bf2d2 CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)
* MMQ for iq4_k: WIP (not working)

* MMQ for iq4_k: working now

* MMQ for iq5_k

* Cleanup

* MMQ for iq5_k: slightly faster

* MMQ for iq6_k

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 14:04:11 +03:00
Kawrakow
fba62d61c0 Fix SER (CUDA) (#416)
* Fixing SER bugs

* Cleanup

* This seems to fix it.

* This seems to work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 07:29:28 +03:00