Commit Graph

4007 Commits

Author SHA1 Message Date
Kawrakow
2fb54d232f Fuse Q and K RoPE (#980)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 09:08:42 +01:00
hksdpc255
da5de88073 common: Generalized XML-style tool-call parsing with streaming support (#958)
* port upstream https://github.com/ggml-org/llama.cpp/pull/16932

* Add fixed chat templates.

* fix grammar when tool have no argument

* Insert additional stops for Kimi-K2

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* chat: Fix streaming parser for granite models (#15682)

* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser

* common : Fix corrupted memory error on json grammar initialization (#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    #1 0x000100873910 in std::sys::pal::unix:🧵:Thread:🆕:h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING

* common : fix reasoning before forced tool call via tool_choice = required (#16264)

* common : fix reasoning before forced tool call via tool_choice = required

* common : improve reasoning and commentary handling when tool_choice is required

(cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475)

---------

Co-authored-by: Alde Rojas <hello@alde.dev>

* Try fix Jinja template for GLM

* Improve Kimi-K2 chat template

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com>
Co-authored-by: David Ribeiro Alves <davidralves@gmail.com>
Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2025-11-18 15:29:58 +01:00
Iwan Kawrakow
eb3ce7549b Minor 2025-11-18 08:55:36 +00:00
Kawrakow
25b9ac296f Add usage for -vq, --validate-quants (#977)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-17 16:02:14 +01:00
Kawrakow
294aec2bc2 Add mqkv and rcache for Gemma3 (#972)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 19:10:41 +02:00
Kawrakow
a8b3c8ae73 Fix rtr when mqkv is enabled (#971)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 16:51:45 +02:00
Kawrakow
17d618a6dd Add ability to use RoPE cache to DeepSeek models (#970)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 16:50:02 +02:00
Kawrakow
3008fdf0b6 Allow distinct output tensor for Gemma models (#969)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 12:12:41 +02:00
Kawrakow
0388a72d5d Fix RoPE cache on multi-GPU setup (#966)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 11:50:48 +02:00
Kawrakow
085a4d0cb9 Fix ggml_cuda_fattn_is_supported (#968)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 11:50:29 +02:00
firecoperana
fcb0b472dd Fix kv cache save and load for GLM model (#965)
Co-authored-by: firecoperana <firecoperana>
2025-11-15 17:04:16 +02:00
firecoperana
f3db96e539 Fix compiler warnings (#963)
* Fix changes meaning warnings

* A couple of more warnings and formatting

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-15 07:07:15 +02:00
firecoperana
0cb6dcc8c8 server: cache prompt to host memory (#954)
* server : host-memory prompt caching

change similarity calculation and prompt save conditions

Remove unneeded token limit

rename variable

Separate prompt save and load logic

change default values

change log

remove truncate prompt logic

* add description

* bug fixes

* remove token limit in init

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-14 18:40:13 +02:00
Kawrakow
2642f48921 Add --chat-template-file to usage (#959)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 11:08:58 +02:00
Kawrakow
9e2b21fbc9 DeepSeek: enable option to merge Q and K tensors (#941)
* Merge Q and K for DeepSeek

* Formatting

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 08:23:04 +02:00
firecoperana
ba1a753106 Fix cuda init error in rpc (#957)
Co-authored-by: firecoperana <firecoperana>
2025-11-14 06:59:54 +02:00
Kawrakow
716d318fd8 Fix fused up+gate when mmq is not supported (#952)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:59:27 +02:00
Kawrakow
459bf5812d Add missing AVX512 operators for MSVC (#948)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:51 +02:00
Kawrakow
41bde27541 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Iwan Kawrakow
be1a8cb9d8 Fix flash attention long argument for mainloine compatibility 2025-11-13 19:22:16 +02:00
Kawrakow
f4202c812e Fix repacked legacy quants (#951)
* Fix q5_0_r4

The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.

* Fix q6_0_r4

Same fix as q5_0_r4

* Fix q4_0_r8

* Fix q5_0_r4 and q6_0_r4 also on Zen4

* Fix q4_0_r8 also on Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 15:35:37 +02:00
Kawrakow
38abd0e289 Set default MLA to 3 also in llama-bench (#949)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 09:52:06 +02:00
Kawrakow
bbc127d10e Add mainline compatible FA command line option (#944)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 08:55:33 +02:00
Kawrakow
5f73351638 Fix rope_norm_fast_cuda (#945)
* Fix rope_norm_fast_cuda

* One more

* Also fix mrope and vision

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 08:54:37 +02:00
Kawrakow
8a8de91a42 Set mla=3 by default (#943)
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-12 11:00:58 +02:00
Kawrakow
9ecfee6c03 Enable fusion by default (#939)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:35:48 +02:00
Kawrakow
463c694676 Opt from #880 also for iqk cuda gemv (#938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:01:34 +02:00
Kawrakow
5e7f6711e4 Add --n-cpu-moe to llama_bench (#937)
* Add --n-cpu-moe to llama_banch

* Add usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:44:59 +02:00
Kawrakow
1e6f8ff89a Add rcache to llama-bench (#936)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:06:18 +02:00
Kawrakow
489554bc11 Minor: remove unnecesssary calls to build_inp_out_ids (#935)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 17:38:46 +02:00
Kawrakow
e4145c013f Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
Kawrakow
a313b71bf8 DeepSeek FA optimizations (#929)
* Use new-new-mma also for MLA=3, and use mask bounds

This gives us ~25% better PP at 32k tokens compared to main

* This seems better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:55:30 +02:00
Kawrakow
7747000f3b DeepSeek TG optimizations for TG (#928)
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1

Combined effect: about +2% in TG performance with full GPU offload

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:52:07 +02:00
firecoperana
9dfbc69aee Server: Add --draft-params to set draft model parameter via command line args (#932)
* Add command line argument for draft model

* Remove second context of draft model

* Format print

* print usage if parsing -draft fails

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-10 09:51:07 +02:00
Kawrakow
ad688e10f4 Use fused gemv+add only for TG (#933)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 08:34:24 +02:00
Kawrakow
db3bed2461 Make biased gemv fusion optional (#931)
* Make biased gemv fusion optional

* Fix one path through gemv fusion

* Remove forgotten printf

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 19:09:47 +02:00
Iwan Kawrakow
0db683e478 Fix compiler warning 2025-11-09 14:35:59 +02:00
Lennart Lopin
1da9c218b0 Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922)
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.

Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions

Build requirements for ARM:
  cmake .. -DGGML_CUDA=ON \
           -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
           -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"

Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory

Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
2025-11-09 14:22:40 +02:00
firecoperana
ff4c1c6eb3 server: bug fix for preserved_tokens not preserved in process_token (#926)
Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:29 +02:00
firecoperana
03235a4b11 Fix embedding missing, CORS and crash using verbose in server (#924)
* server: fix crash when prompt has image and is too long

* server: fix CORS

* server: fix empty result for embedding

* change error message to truncate prompt

* server: fix slot id for save and load state

* bug fix

* server: update slot similarity to handle mtmd

* server: quick hack to calculate number of token processed with image

* server: fix out of range error when detokenizing prompt under verbose

* Add back Access-Control-Allow-Origin

* Server: Add prompt tokens in embedding results

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:03 +02:00
Kawrakow
9207a48ab2 CUDA MoE improvements (#923)
* Use mmq_id in mul_mat_id

* Better

* Also use it in the fused up+gate op

* Better -no-fmoe TG on CUDA

Still much slower than -fmoe, but abot 20-25% faster than what
we had before.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 11:34:33 +02:00
Kawrakow
e5fc02c71a CUDA: fuse copies to K and V cache (#921)
* Fuse copies to K- and V-cache on CUDA

* Adapt to latest main

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 18:13:58 +02:00
Kawrakow
55576c93b2 Adopt fix from mainline PR 17089 (#920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 07:44:20 +02:00
Kawrakow
d62e8c51ed Disable add + fused_rms_norm fusion (#916)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 19:38:18 +02:00
Kawrakow
12b9902395 Fix PPL increase caused by mmq_id (#913)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 18:58:09 +02:00
Nexes the Elder
6a805c73b4 More informative PPL readout line (#914)
* More informative PPL readout line

* trailing whitespace..
2025-11-07 16:41:24 +02:00
Kawrakow
9d0b834405 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
665434e5ec Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-06 19:07:46 +02:00
firecoperana
0378f38c27 model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Iwan Kawrakow
575e2c2821 Fix compiler warning 2025-11-06 07:12:07 +02:00