ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 15:14:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	6ba96c8b33	For now have only iq4_kt use the new trellis	2025-06-18 15:34:27 +03:00
Iwan Kawrakow	b5524af7a4	New iq4_kt: CUDA MMQ	2025-06-18 15:34:27 +03:00
Iwan Kawrakow	6d6e6e39c9	New iq4_kt: CUDA MMVQ	2025-06-18 15:34:27 +03:00
Iwan Kawrakow	de0b38dcdc	Something is not working with the AVX2 dot product	2025-06-18 15:34:27 +03:00
Iwan Kawrakow	e558992f0c	New iq4_kt trellis The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis.	2025-06-18 15:34:25 +03:00
Kawrakow	c410cc72bb	Much faster CPU prompt processing (part 3) (#534 ) * Repack q4_0 and q8_0 to q8_0_R8 q8_0 is fine, but I observe a very significant PPL increase for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit scale conversions. * Change q8_2_x4 to store in16_t sums With that q4_0 now works. I need to check all quants that use q8_2_x4! * q5_0 and use a dequntizing template * q6_0 129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s. * iq4_nl 137 t/s -> 293 t/s. iq4_nl is at 251 t/s. * q4_1: 135 t/s -> 262 t/s * q5_1: 125 t/s -> 253 t/s * iq3_xs 178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s. * q2_K 202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-18 15:30:56 +03:00
Kawrakow	dc96820ddb	Much faster CPU prompt processing (part 2) (#533 ) * iq4_ks 203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s. * iq4_k 175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s. PPL is actually lower! * iq5_ks 180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s. PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct * iq5_k - accuracy loss is too big * iq5_k - there was a bug with the shifts ...and that's why PPL was so high. It is also high on main. This fixes it. * iq6_k 148 t/s -> 350 t/s. There is no iq6_k_r4 PPL is actually lower because we have a bug in the existing implementation! * iq3_k 169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s. * iq2_k 190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s. * iq2_ks 200 t/s -> 367 t/s. There is no iq2_ks_r4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-18 07:29:33 +03:00
Kawrakow	8b3002bba2	Send [DONE] for OAI compatibility (#470 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-17 10:32:53 +03:00
Kawrakow	0f8f8b32e2	Much faster CPU prompt processing (part 1) (#531 ) * q6_K dequantizing GEMM * Much easier: just use different vec_dot types! * WIP * Finally q6_K x q8_2_x4 dot product works * Very slightly better * We don't need the changes in ggml.c * Fix AVX2 * iq2_xs * Fix AVX2 * iq2_s * q3_K * Fix q8_k_r8 on Zen4 * q3_K: repack to q8_k_r8 instead of q8_0_r8 With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X. q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs ~7% of the time taken by the actual GEMM. * q3_K: don't scale when all quants in a block are <= 127 when repacking * iq2_s: repack to q8_k_r8 instead of q8_0_r8 * iq2_xs: rapck to q8_k_r8 * WIP * iq2_xs: repack to q8_k_r8 * iq3_xxs: repack to q8_k_r8 * iq3_s: use q8_k_r8 * iq1_s: repack to q8_k_r8 * iq1_m: repack to q8_k_r8 * iq1_m: slightly faster * Slightly faster --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-17 07:12:48 +03:00
Kawrakow	6fc5bbb657	Call iqk_convert_repack in MoE GEMM (#528 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-14 05:52:46 +03:00
Kawrakow	066ed4fd11	Faster CPU prompt processing for Q4_K and Q5_K (#525 ) * q4_K: dequantize to q8_1_r8 for batch >= 32 We get 268 t/s, up from 186 t/s. * q4_K: GEMM with q8_2_X4 * q5_K: GEMM with q8_2_X4 and repack to q8_1_r8 * Remove the scales, they are not needed --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-13 07:58:15 +03:00
saood06	f72983f7fe	Update News section of readme (#510 ) * Convert existing News to new format * Update with new ones * Add more links and minor fix * more minor fixes * requested changes * Add old PRs * Add more old PRs * Add all IQK quants	2025-06-13 07:56:40 +03:00
Kawrakow	7a882f0b63	Perhaps a slightly better version for IQ2_XXS, IQ3_XXS, IQ3_S GEMV (#524 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-13 07:55:57 +03:00
Kawrakow	b57bd8658b	Better strategy for GPU offload (#520 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-12 19:25:11 +03:00
firecoperana	7b1a3eece7	Add top n sigma sampler and other webui fix (#512 ) Co-authored-by: firecoperana <firecoperana>	2025-06-12 08:19:26 +03:00
Kawrakow	4fc3cb4a47	iq3_s: much faster GEMM via repacking to q8_0_r8 (#518 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-12 08:16:12 +03:00
Kawrakow	3f54b49786	Faster iq1_s GEMM via repacking to Q8_0_R8 (#517 ) TG is slightly faster too - 24.4 vs 23.1 t/s on the Ryzen-5975WX Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 15:01:34 +03:00
Kawrakow	69af3f5990	Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2) (#516 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 13:05:26 +03:00
Kawrakow	e56061fa12	IQ2_XXS: much faster CPU prompt processing (#515 ) * Much faster iq2_xxs GEMM PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main. * iq2_xxs: q8_2_x4 GEMM * iq2_xxs: use template for q8_2_x4 GEMM * Fix AVX2 * Cleanup * NEON is not working yet, so still use Q8_K GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-11 11:12:30 +03:00
Gaolingx	3c1f2c68fd	Fix Compile error (C2668) (#508 ) * cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt * Fix Compile error (C2668) * revert hsum_float_8x8	2025-06-10 08:30:17 +03:00
saood06	fa90a9864a	Docs update (#509 ) * use npm as deps manager and vite as bundler * update XTC docs --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-06-09 05:32:03 -05:00
firecoperana	58f08e4385	Fix non rpc build error (#506 ) * Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui * fix rpc function not found error --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>	2025-06-08 17:27:00 +03:00
Iwan Kawrakow	1eabdb420b	Revert "Rpc improvement (#480 )" This reverts commit `8a5f8573ae`.	2025-06-08 14:49:50 +03:00
firecoperana	8a5f8573ae	Rpc improvement (#480 ) * Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>	2025-06-08 14:43:21 +03:00
Kawrakow	63ef0a392b	Update AUTHORS	2025-06-08 14:41:17 +03:00
firecoperana	df170c83a5	Webui improvement (#481 ) * update webui * add token/s in webui * add webui files * fix webui first message disappear in some browser * add missing html files --------- Co-authored-by: firecoperana <firecoperana>	2025-06-08 14:38:47 +03:00
saood06	9e567e385a	Add an endpoint that lists all the saved prompt caches to server (#502 )	2025-06-07 00:22:56 -05:00
Kawrakow	8c1d5a2033	Fix #499 (#501 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-07 08:19:18 +03:00
saood06	ffd87f282e	Make prompt cache saving and restoring MLA aware (#497 ) * Remove kv_l, kvt_l and just use k_l and v_l * Hopefully take care of missing V cache (MLA) * Fix save and restore when there is no V cache * Fix double print * Update write_kv_cache_data and read_kv_cache_data to be MLA aware --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-06 11:33:47 +03:00
Kawrakow	eded4e20d4	IQ1_M_R4 CUDA implementation (#494 ) * iq1_m_r4: CUDA dequantize * iq1_m_r4: CUDA dequantize --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 19:13:51 +03:00
Kawrakow	8ffad187ab	MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493 ) * MMQ for iq4_ks_r4 * MMQ for iq5_ks_r4 * Add forgotten file * Another forgotten file --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 08:31:20 +03:00
Kawrakow	0b10f7418f	Faster CPU prompt processing for Trellis quants and MoE models (#488 ) * Also do the dequantize approach for mul_mat_id * Also do the dequantize approach for iqk_moe_fused_up_gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 08:30:35 +03:00
Kawrakow	7e79665a31	CUDA implementation for IQ1_S_R4 (#492 ) * iq1_s_r4: CUDA dequantize * iq1_s_r4: CUDA GEMV * iq1_s_r4: MMQ on CUDA Requires Turing or better (will fall back to dequantize+cuBLAS on older cards). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-05 07:24:31 +03:00
Kawrakow	f6d5fbdc57	Adding top-n-sigma sampler (#489 ) * Adding top-n-sigma sampler * Fix typos in XTC PR * Update README.md for main and server * More README * More README --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 17:35:09 +03:00
Kawrakow	ccb265c016	Adding the XTC sampler (#486 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 11:32:03 +03:00
Nexes the Elder	4f8b05a0d7	convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483 ) * Direct conversion from fp16 to Q6_0 * forgotten comma * More precise infos	2025-06-03 09:30:30 +03:00
Kawrakow	7a8abe29f7	Minor (~2%) iq2_ks TG performance improvement on CUDA (#468 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:24:33 +03:00
Kawrakow	3df1a3a44d	Trellis quants: faster CPU prompt processing (#482 ) * Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:24:05 +03:00
Kawrakow	35374bc7e8	Metal implementatio for the trellis quants. (#475 ) * iq2_kt: Metal dequantize * iq2_kt: Metal GEMV Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B * iq3_kt: Metal dequantize * iq3_kt: Metal GEMV Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair. * iq4_kt: Metal dequantize - getting NaNs * iq4_kt: Metal GEMV - also not working * iq4_kt: Metal still not working * Disable iq4_kt on Metal for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-01 15:23:44 +03:00
Nexes the Elder	7239ce6b35	forgotten refs and typo (#478 )	2025-05-31 07:36:50 +03:00
Kawrakow	2cf12eb12d	Replace MLA-specific KV cache with the standard KV cache (#469 ) * Remove kv_l, kvt_l and just use k_l and v_l * Hopefully take care of missing V cache (MLA) * Replace MLA-specific KV cache with the standard KV cache V2 (#473) * Fix save and restore when there is no V cache * Fix double print --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: saood06 <saood05@gmail.com>	2025-05-30 11:08:17 +03:00
Kawrakow	1eac9e8487	NEON implementation for trellis quants (#471 ) * iq2_kt: NEON implementation * iq3_kt: NEON implementation * iq4_kt: not working NEON implementation * iq4_kt: NEON implementation Have to use f32 arithmetic else I get gibberish? Correspondigly ridiculously slow. * Cleanup * iq4_kt: slightly faster TG on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-29 18:57:41 +03:00
saood06	ccd6d9cdf6	set cache_prompt default to true (#465 )	2025-05-28 08:18:25 +03:00
Kawrakow	0976467845	CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462 ) * CUDA: iq4_ks_r4 GEMV and GEMM * CUDA: iq5_ks_r4 GEMV and GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-27 08:37:44 +03:00
Kawrakow	1429291326	CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461 ) * CUDA: iq4_k_r4 dequantize * CUDA: iq4_k_r4 GEMV ~10% slower than iq4_k. * CUDA: slightly faster iq4_k_r4 GEMV * CUDA: slightly faster iq4_k_r4 GEMV We are now within 3% of iq4_k * CUDA: iq5_k_r4 dequantize * CUDA: iq5_k_r4 GEMV ~3% slower than iq5_k. * CUDA: iq3_k_r4 dequantize * CUDA: iq3_k_r4 GEMV * CUDA: slightly faster iq3_k_r4 GEMV * CUDA: iq2_k_r4 GEMV * CUDA: faster iq2_k_r4 GEMV --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-26 19:34:54 +03:00
Kawrakow	24c010b391	Add missing gguf-py constants (#458 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-25 09:55:36 +03:00
Nexes the Elder	c7ecd4e23a	Legacy quants conversion schemes in convert_hf_to_gguf.py (#449 ) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention	2025-05-24 11:49:10 +03:00
Kawrakow	a2c42f9985	Faster IQ3_KT and IQ4_KT (#453 ) * Somewhat faster iq3_kt (AVX2) * Cleanup * Slightly faster iq4_kt * Slightly faster iq4_kt PP is now almost 50% better than original, TG is ~20% better * Cleanup * Very slightly faster iq4_kt TG --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-24 11:48:52 +03:00
Kawrakow	9fb82af3a8	Fix bug in MMVQ kernel (#446 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 18:25:11 +03:00
Kawrakow	6b12c2e7e8	Fix MSVC compilation (#448 ) * Fix MSVC compilation * MSVC cannot capture constexpr in lambdas * Arghhh --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-23 16:46:27 +03:00

1 2 3 4 5 ...

3758 Commits