Commit Graph

  • 02676a999c Better -no-fmoe TG on CUDA ik/use_mmq_id_for_moe Iwan Kawrakow 2025-11-08 16:13:14 +02:00
  • 8f0fad3109 Also use it in the fused up+gate op Iwan Kawrakow 2025-11-08 12:48:01 +02:00
  • 675f36787d Better Iwan Kawrakow 2025-11-08 12:20:54 +02:00
  • b96d7df1d4 Use mmq_id in mul_mat_id Iwan Kawrakow 2025-11-08 12:02:07 +02:00
  • 54848a4c7e Adapt to latest main ik/fuse_kvcache_copy Iwan Kawrakow 2025-11-07 09:19:45 +02:00
  • 906a3bffd9 Fuse copies to K- and V-cache on CUDA Iwan Kawrakow 2025-11-05 15:21:32 +02:00
  • 3614c4f098 Adopt fix from mainline PR 17089 (#920) Kawrakow 2025-11-08 07:44:20 +02:00
  • 55576c93b2 Adopt fix from mainline PR 17089 (#920) Kawrakow 2025-11-08 07:44:20 +02:00
  • fc31862add Adopt fix from mainline PR 17089 ik/another_mmq_id_fix Iwan Kawrakow 2025-11-08 07:41:28 +02:00
  • d0850dccc8 Disable add + fused_rms_norm fusion (#916) Kawrakow 2025-11-07 19:38:18 +02:00
  • d62e8c51ed Disable add + fused_rms_norm fusion (#916) Kawrakow 2025-11-07 19:38:18 +02:00
  • 1c31b25380 Fix PPL increase caused by mmq_id (#913) Kawrakow 2025-11-07 18:58:09 +02:00
  • 12b9902395 Fix PPL increase caused by mmq_id (#913) Kawrakow 2025-11-07 18:58:09 +02:00
  • 3549305b7a Disable add + fused_rms_norm fusion ik/disable_add_fused_rms Iwan Kawrakow 2025-11-07 18:47:52 +02:00
  • f9a411e5db More informative PPL readout line (#914) Nexes the Elder 2025-11-07 15:41:24 +01:00
  • 6a805c73b4 More informative PPL readout line (#914) Nexes the Elder 2025-11-07 15:41:24 +01:00
  • e49cfff302 Fix PPL increase caused by mmq_id ik/fix_mmq_id Iwan Kawrakow 2025-11-07 13:56:24 +02:00
  • 532a05e466 CUDA: set compute parameters via command line arguments (#910) Kawrakow 2025-11-07 07:11:23 +02:00
  • 9d0b834405 CUDA: set compute parameters via command line arguments (#910) Kawrakow 2025-11-07 07:11:23 +02:00
  • 49befdd4fb Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911) Kawrakow 2025-11-06 19:07:46 +02:00
  • 665434e5ec Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911) Kawrakow 2025-11-06 19:07:46 +02:00
  • 4fe0705abe Fix iqk_mul_mat when number of rows is not multiple of repack rows ik/fix_iqk_for_strange_numrows Iwan Kawrakow 2025-11-06 19:00:25 +02:00
  • e15a215e6b model : Port Minimax M2 from mainline (#907) firecoperana 2025-11-06 16:09:24 +00:00
  • 0378f38c27 model : Port Minimax M2 from mainline (#907) firecoperana 2025-11-06 16:09:24 +00:00
  • 06e9fcd4d8 Also llama-bench ik/cuda_params Iwan Kawrakow 2025-11-06 18:08:03 +02:00
  • ccabf5713d cuda: set compute parameters via command line arguments Iwan Kawrakow 2025-11-06 15:55:44 +02:00
  • 66ef68bc14 Fix compiler warning Kawrakow 2025-11-06 07:12:07 +02:00
  • 575e2c2821 Fix compiler warning Iwan Kawrakow 2025-11-06 07:12:07 +02:00
  • 18f5a6caef Bug fixes for completions and prompt caching in server (#906) firecoperana 2025-11-06 05:10:51 +00:00
  • d0dfbf9771 Bug fixes for completions and prompt caching in server (#906) firecoperana 2025-11-06 05:10:51 +00:00
  • 50f95d7bf3 Disable CUDA fusion by default for now (#903) Kawrakow 2025-11-05 10:58:12 +02:00
  • 320fc606cd Disable CUDA fusion by default for now (#903) Kawrakow 2025-11-05 10:58:12 +02:00
  • bdfa4bbe29 Disable CUDA fusion by default for now ik/disable_fusion_by_default Iwan Kawrakow 2025-11-05 10:56:00 +02:00
  • cb30f8e057 Merge Q and K into a single tensor (#892) Kawrakow 2025-11-05 10:54:36 +02:00
  • 1a3aaa33c1 Merge Q and K into a single tensor (#892) Kawrakow 2025-11-05 10:54:36 +02:00
  • e68f50be9a Allow quantization of ffn_gate_inp (#896) Kawrakow 2025-11-05 10:44:32 +02:00
  • abb966eba1 Allow quantization of ffn_gate_inp (#896) Kawrakow 2025-11-05 10:44:32 +02:00
  • 7978f04996 Add vision support in llama-server (#901) firecoperana 2025-11-05 08:43:46 +00:00
  • 15159a87d4 Add vision support in llama-server (#901) firecoperana 2025-11-05 08:43:46 +00:00
  • 92607d44c4 Much better CPU TG performance at long context for GLM-4.5 (#899) Kawrakow 2025-11-05 10:20:26 +02:00
  • 5b38d431ac Much better CPU TG performance at long context for GLM-4.5 (#899) Kawrakow 2025-11-05 10:20:26 +02:00
  • 98357d9aa5 Adding cmake option to disable CUDA fusion (#902) Kawrakow 2025-11-05 07:09:27 +02:00
  • 85035d606c Adding cmake option to disable CUDA fusion (#902) Kawrakow 2025-11-05 07:09:27 +02:00
  • 5aa5ebcb97 Adding cmake option to disable CUDA fusion ik/option_to_disable_cuda_fusion Iwan Kawrakow 2025-11-05 06:20:36 +02:00
  • 11feb49562 Fix compilation failure after merging #883 (#900) Kawrakow 2025-11-04 19:28:52 +02:00
  • dfa9689e72 Fix compilation failure after merging #883 (#900) Kawrakow 2025-11-04 19:28:52 +02:00
  • ba593f3ba6 Fix compilation failure after merging #883 ik/fix_after_883 Iwan Kawrakow 2025-11-04 19:27:52 +02:00
  • 86597623a5 Port of Qwen3-VL support from mainline (#883) Thireus ☠ 2025-11-04 17:20:54 +00:00
  • 5536e99d42 Port of Qwen3-VL support from mainline (#883) Thireus ☠ 2025-11-04 17:20:54 +00:00
  • 202924d9fe Much better CPU TG performance at long context for GLM-4.5 ik/cpu_fa_tg_glm4.5 Iwan Kawrakow 2025-11-04 17:47:36 +02:00
  • efcb5f9d9e sweep-bench: be able to set TG tokens via -n (#897) Kawrakow 2025-11-04 14:39:30 +02:00
  • 7e956a32ce sweep-bench: be able to set TG tokens via -n (#897) Kawrakow 2025-11-04 14:39:30 +02:00
  • 42c34f5c49 sweep-bench: be able to set TG tokens via -n ik/sweep_bench_n_predict Iwan Kawrakow 2025-11-04 12:55:04 +02:00
  • 04e57f4356 Allow quantization of ffn_gate_inp ik/quantize_ffn_gate_inp Iwan Kawrakow 2025-11-04 11:34:10 +02:00
  • 24f3cb644c Make V mul mat follow QK mul mat ik/merge_only_qk Iwan Kawrakow 2025-11-04 10:39:23 +02:00
  • 1280304e46 Merge remote-tracking branch 'origin/main' into ik/merge_only_qk Iwan Kawrakow 2025-11-04 10:30:05 +02:00
  • c23fda2103 Disable some fusion, RoPE cache off by default (#894) Kawrakow 2025-11-04 07:50:14 +02:00
  • cd8d0b0832 Disable some fusion, RoPE cache off by default (#894) Kawrakow 2025-11-04 07:50:14 +02:00
  • 8735931413 Minor ik/disable_some_fusion Iwan Kawrakow 2025-11-04 07:46:40 +02:00
  • 45601bc1bf Disable some fusion and make rope cahe off by default Iwan Kawrakow 2025-11-04 07:42:55 +02:00
  • fb0d5a995c RoPE cache (#887) Kawrakow 2025-11-03 18:42:20 +02:00
  • 1cfd19862f RoPE cache (#887) Kawrakow 2025-11-03 18:42:20 +02:00
  • 57af463614 Fused fused_rms+fused_rms+rope+rope (without -mqkv) ik/rope_cache Iwan Kawrakow 2025-11-03 18:31:56 +02:00
  • 9f9866b710 Fused fused_rms+fused_rms+rope+rope (with -mqkv) Iwan Kawrakow 2025-11-03 18:21:42 +02:00
  • 48b02ccaa1 Merge Q and K into a single tensor Iwan Kawrakow 2025-11-03 14:37:03 +02:00
  • a7b427ffc0 Option to enable CUDA LTO ik/cuda_lto Iwan Kawrakow 2025-11-03 10:34:48 +02:00
  • 0dc705587a Add missing break after merge with main Iwan Kawrakow 2025-11-03 09:46:27 +02:00
  • d2f79beba4 Disable RoPE cache if rope type is not neox or norm Iwan Kawrakow 2025-11-03 08:28:26 +02:00
  • 525dda2e80 Add command line arg to disable rope cache Iwan Kawrakow 2025-11-03 08:20:03 +02:00
  • aa76ff2c9d Also qwen3 Iwan Kawrakow 2025-11-02 10:14:43 +02:00
  • 60d56fa2d0 WIP Iwan Kawrakow 2025-11-02 10:00:26 +02:00
  • 332c4d6680 Fused rms+rms+rope+rope (neox) - not working Iwan Kawrakow 2025-11-02 09:48:25 +02:00
  • 623d775929 Fused rope+rope (norm) Iwan Kawrakow 2025-11-02 07:21:17 +02:00
  • f5ac78de5c Fused rope+rope Iwan Kawrakow 2025-11-02 06:57:08 +02:00
  • ea97dc3a1c rope_cache: norm works Iwan Kawrakow 2025-11-01 18:31:38 +02:00
  • 209bf1d29c WIP Iwan Kawrakow 2025-11-01 18:17:49 +02:00
  • f2c4b3a8d1 cuda: neox works Iwan Kawrakow 2025-11-01 17:54:25 +02:00
  • 9a790a8905 Introducing rope cache Iwan Kawrakow 2025-11-01 15:58:11 +02:00
  • 846e736e85 cuda: add missing backwards RoPE op (#889) Kawrakow 2025-11-03 07:45:18 +02:00
  • d890b9fee0 cuda: add missing backwards RoPE op (#889) Kawrakow 2025-11-03 07:45:18 +02:00
  • 3b9ace5c65 cuda: add missing backwards RoPE op ik/cuda_rope_back Iwan Kawrakow 2025-11-03 07:37:01 +02:00
  • 37c4d19021 Compiler warning Kawrakow 2025-10-31 14:58:00 +02:00
  • 58922c23ca Compiler warning Iwan Kawrakow 2025-10-31 14:58:00 +02:00
  • 55a704b67a Fused Q and K fused_rms_norm for TG on CUDA (#882) Kawrakow 2025-10-31 14:41:28 +02:00
  • 8c8a7fb7c8 Fused Q and K fused_rms_norm for TG on CUDA (#882) Kawrakow 2025-10-31 14:41:28 +02:00
  • b58e81d48c Remove commented out code ik/fused_rms_rms Iwan Kawrakow 2025-10-31 14:30:27 +02:00
  • 76789b2d0f Merge remote-tracking branch 'origin/main' into ik/fused_rms_rms Iwan Kawrakow 2025-10-31 14:22:54 +02:00
  • cfb840379f Biased mmvq: minor optimization (#880) Kawrakow 2025-10-31 14:21:18 +02:00
  • fd3757d4ee Biased mmvq: minor optimization (#880) Kawrakow 2025-10-31 14:21:18 +02:00
  • a3bd0158f7 Disable pipeline parallel for tensor override or allocation failed (#879) firecoperana 2025-10-31 12:20:48 +00:00
  • c7dbe3f2c1 Disable pipeline parallel for tensor override or allocation failed (#879) firecoperana 2025-10-31 12:20:48 +00:00
  • 476e425d51 Fusing Q and K rms_norm for TG on CUDA Iwan Kawrakow 2025-10-30 16:55:31 +02:00
  • bb4752d019 Biased mmvq: minor optimization ik/biased_mmvq Iwan Kawrakow 2025-10-30 10:50:17 +02:00
  • 56fc5454ff Merge Q, K, V (#878) Kawrakow 2025-10-30 10:49:48 +02:00
  • 14760aaf46 Merge Q, K, V (#878) Kawrakow 2025-10-30 10:49:48 +02:00
  • 68e7698ae8 cohere2 - simplify graph building ik/merge_qkv Iwan Kawrakow 2025-10-30 08:17:55 +02:00
  • bba84f1c07 merge_qkv: simplify build_qwen3moe Iwan Kawrakow 2025-10-29 18:39:31 +02:00
  • 92517e74ad fix v1/chat/completions assistant prefill (#874) jarrodfeaks 2025-10-30 02:21:05 +11:00
  • c029d97492 fix v1/chat/completions assistant prefill (#874) jarrodfeaks 2025-10-30 02:21:05 +11:00
  • a2f3b08fbd merge_qkv: qwen3 (dense) Iwan Kawrakow 2025-10-29 15:59:38 +02:00