Commit Graph

  • cc8c0e1b49 More cleanup Saood Karim 2025-03-25 14:29:00 -05:00
  • 109f5c0cd8 Cleanup Saood Karim 2025-03-25 14:23:11 -05:00
  • c821129fcb More fix Saood Karim 2025-03-25 13:42:02 -05:00
  • 9b6b55f441 More fix Saood Karim 2025-03-25 12:16:43 -05:00
  • 8ab6b155d5 Fixes to make previous commits compile Saood Karim 2025-02-02 14:25:19 -06:00
  • e0101cfe5a NUMA-aware KV cache buffer type (experimental) Saood Karim 2025-02-02 13:18:33 -06:00
  • b307c1c375 llama-bench: enable having different number of threads for tg and pp (#284) Kawrakow 2025-03-25 16:31:17 +01:00
  • a22250df93 llama-bench: enable having different number of threads for tg and pp (#284) Kawrakow 2025-03-25 16:31:17 +01:00
  • c12a6f8558 Update sweep bench (depracating .jsonl support) (#289) saood06 2025-03-25 10:14:44 -05:00
  • 279b7d3395 Update sweep bench (depracating .jsonl support) (#289) saood06 2025-03-25 10:14:44 -05:00
  • 7a6f681daf Fix README.md s6/sweep_bench_update Saood Karim 2025-03-25 09:15:19 -05:00
  • 2fd035b43f Update sweep bench (depracating .jsonl support) Saood Karim 2025-03-25 08:58:50 -05:00
  • daa3b00ccd Minor ik/deepseek_is_this_better Iwan Kawrakow 2025-03-25 09:04:45 +02:00
  • 6ef4954612 CUDA: better MoE implementation (#283) Kawrakow 2025-03-25 07:47:10 +01:00
  • 98a264a2ea CUDA: better MoE implementation (#283) Kawrakow 2025-03-25 07:47:10 +01:00
  • be46f3ef14 Is this better for DeepSeek-R1? Iwan Kawrakow 2025-03-24 21:18:06 +02:00
  • e3ebf3cbb9 Add -tgb to usage ik/llama_bench_tgb Iwan Kawrakow 2025-03-24 18:15:48 +02:00
  • 52fd0ac16a llama-bench: enable having different number of threads for tg and pp Iwan Kawrakow 2025-03-24 17:48:08 +02:00
  • 7f6980fa51 Also do it for non-fused mul_mat_id ik/cuda_better_moe Iwan Kawrakow 2025-03-24 15:47:29 +02:00
  • f1bef9046e Slightly better Iwan Kawrakow 2025-03-24 15:22:16 +02:00
  • fb8db62e5a Make fused MoE reproducible Iwan Kawrakow 2025-03-24 11:53:19 +02:00
  • a9a941b5b8 Improve DeepSeek batched processing speed (#282) Kawrakow 2025-03-23 17:10:52 +01:00
  • f9307d7907 Improve DeepSeek batched processing speed (#282) Kawrakow 2025-03-23 17:10:52 +01:00
  • ec4bc75f90 Revert the commented out section in iqk_mul_mat.cpp ik/better_batched_processing Iwan Kawrakow 2025-03-23 13:29:14 +02:00
  • d12f4a12aa Improve DeepSeek batched processing speed Iwan Kawrakow 2025-03-23 11:55:01 +02:00
  • 23ee1ac1b8 Attempt to improve FlashMLA on the CPU (#277) Kawrakow 2025-03-23 07:28:21 +01:00
  • 5a4855e61c Attempt to improve FlashMLA on the CPU (#277) Kawrakow 2025-03-23 07:28:21 +01:00
  • 79a105d8ab Test transparent huge pages on Linux (#278) Kawrakow 2025-03-23 07:24:43 +01:00
  • dd5ebd0e3d Test transparent huge pages on Linux (#278) Kawrakow 2025-03-23 07:24:43 +01:00
  • b608eeba06 Add -thp to llama-bench ik/test_thp Iwan Kawrakow 2025-03-22 19:58:38 +02:00
  • 37c48feb3e Native build ooption for CUDA when GGML_NATIVE is set (#280) Kawrakow 2025-03-22 18:17:51 +01:00
  • 6028362ef6 Native build ooption for CUDA when GGML_NATIVE is set (#280) Kawrakow 2025-03-22 18:17:51 +01:00
  • 46b782b5c0 Native build ooption for CUDA when GGML_NATIVE is set ik/cuda_native Iwan Kawrakow 2025-03-22 18:17:30 +02:00
  • 5a67c8322e Fighting with cmake (#279) Kawrakow 2025-03-22 16:58:30 +01:00
  • 13ecc5332e Fighting with cmake (#279) Kawrakow 2025-03-22 16:58:30 +01:00
  • ffabdce3ce Fighting with cmake ik/fix_again_cmake Iwan Kawrakow 2025-03-22 17:49:43 +02:00
  • 68aa5b19a8 Use the actual page size4 used for mmap also in munmap Iwan Kawrakow 2025-03-22 14:40:43 +02:00
  • 54d9cb79ec Adding ability to use THP on Linux Iwan Kawrakow 2025-03-22 14:32:23 +02:00
  • 0964a49990 Cleanup ik/better_flash_mla Iwan Kawrakow 2025-03-22 12:00:26 +02:00
  • 988be1f8f0 Handle rk2%nth_k != 0 Iwan Kawrakow 2025-03-22 11:53:15 +02:00
  • ece257f645 Fix it for nth > rk2 Iwan Kawrakow 2025-03-22 10:55:06 +02:00
  • 42b0e3921b Add Gemma3 support (text only) (#276) Kawrakow 2025-03-22 08:05:10 +01:00
  • d8584a1bbe Add Gemma3 support (text only) (#276) Kawrakow 2025-03-22 08:05:10 +01:00
  • e1684a8d47 Revert changes to convert_hf_to_gguf.py ik/gemma3 Iwan Kawrakow 2025-03-21 19:01:33 +02:00
  • f939b55946 gemma3: build_gemma3 seems to be working now Iwan Kawrakow 2025-03-21 18:21:33 +02:00
  • 0989a9f76f WIP Gemma3: not working Iwan Kawrakow 2025-03-21 17:20:11 +02:00
  • eff34cf265 Fix bug: missing parentheses in logical expression (#275) Kawrakow 2025-03-21 13:23:01 +01:00
  • 3d6e25c82d Fix bug: missing parentheses in logical expression (#275) Kawrakow 2025-03-21 13:23:01 +01:00
  • 5e1944bdec Fix bug: missing parentheses in logical expression ik/bug_missing_parentheses Iwan Kawrakow 2025-03-21 14:17:48 +02:00
  • 4158743014 Specify tensor name regex for tensors to be repacked (#274) Kawrakow 2025-03-21 10:51:37 +01:00
  • 022660f7ab Specify tensor name regex for tensors to be repacked (#274) Kawrakow 2025-03-21 10:51:37 +01:00
  • da7d0ffba6 Specify tensor name regex for tensors to be repacked ik/offline_repack_patterns Iwan Kawrakow 2025-03-21 09:03:13 +02:00
  • 24e780ba74 FlashMLA-3: the best of both worlds (CPU only) (#273) Kawrakow 2025-03-21 07:24:22 +01:00
  • ddc8eee10e FlashMLA-3: the best of both worlds (CPU only) (#273) Kawrakow 2025-03-21 07:24:22 +01:00
  • c5e554f941 Convert models to row-interleaved quants using the quantize tool (#272) Kawrakow 2025-03-21 07:23:36 +01:00
  • b8d1fac97b Convert models to row-interleaved quants using the quantize tool (#272) Kawrakow 2025-03-21 07:23:36 +01:00
  • 4632cb94d8 FlashMLA-3: the best of both worlds - CPU only ik/FlashMLA-3 Iwan Kawrakow 2025-03-20 16:07:18 +02:00
  • 9fe6fc3782 Add missing include ik/offline_repack Iwan Kawrakow 2025-03-20 16:57:38 +02:00
  • 94576a5a6e Another one Iwan Kawrakow 2025-03-20 16:51:19 +02:00
  • d27b72268a Fix GCC 13.3 compilation error Iwan Kawrakow 2025-03-20 16:46:06 +02:00
  • 9fbe5beef7 Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved Iwan Kawrakow 2025-03-20 15:24:15 +02:00
  • fe24edab76 Fixed various issues Iwan Kawrakow 2025-03-20 11:52:59 +02:00
  • 561cc0cef8 WIP Iwan Kawrakow 2025-03-20 10:17:06 +02:00
  • 20df7b89c8 Repack a model with the quantize tool Iwan Kawrakow 2025-03-20 09:11:33 +02:00
  • 712de34b12 Honor mmap setting when using tensor overrides (#270) Kawrakow 2025-03-19 19:17:03 +01:00
  • 127c6ee649 Honor mmap setting when using tensor overrides (#270) Kawrakow 2025-03-19 19:17:03 +01:00
  • 1b62d0fae3 Honor mmap setting when using tensor overrides ik/tensor_override_honor_mmap Iwan Kawrakow 2025-03-19 17:05:04 +02:00
  • f2997472f4 Fix ggml_compute_forward_dup_q (#269) Kawrakow 2025-03-19 15:47:24 +01:00
  • 22c84a126f Fix ggml_compute_forward_dup_q (#269) Kawrakow 2025-03-19 15:47:24 +01:00
  • 60c9495c2f Fix ggml_compute_forward_dup_q ik/fix_dup_q Iwan Kawrakow 2025-03-19 16:44:34 +02:00
  • 623b5b6cca Prevent FlashMLA-1 from running on CUDA (#268) Kawrakow 2025-03-19 13:03:59 +01:00
  • c3b75c531c Prevent FlashMLA-1 from running on CUDA (#268) Kawrakow 2025-03-19 13:03:59 +01:00
  • 529f75c220 Prevent FlashMLA-1 from running on CUDA ik/avoid_cuda_mla_1 Iwan Kawrakow 2025-03-19 12:07:51 +02:00
  • 1bc07b5ccd Allow q8_0 cache on the CPU for FlashMLA-2 (#265) Kawrakow 2025-03-18 15:41:05 +01:00
  • 8e549b4234 Allow q8_0 cache on the CPU for FlashMLA-2 (#265) Kawrakow 2025-03-18 15:41:05 +01:00
  • 7d8119d0ba Make Q8_0 KV cache work with mla=2,fa on CUDA (#264) Kawrakow 2025-03-18 15:40:47 +01:00
  • 68a5b60408 Make Q8_0 KV cache work with mla=2,fa on CUDA (#264) Kawrakow 2025-03-18 15:40:47 +01:00
  • 96d1235fb0 Allow q8_0 cache on the CPU for FlashMLA-2 ik/mla2_q80_cache_cpu Iwan Kawrakow 2025-03-18 14:08:52 +02:00
  • a9440bd3e9 Make Q8_0 KV cache work with mla=2,fa on CUDA ik/mla2_q80_cache Iwan Kawrakow 2025-03-18 11:57:32 +02:00
  • 264071c351 Fix #261 (#262) Kawrakow 2025-03-18 07:44:43 +01:00
  • f4ebf13b6a Fix #261 (#262) Kawrakow 2025-03-18 07:44:43 +01:00
  • 55b2cf98d2 Fix #261 ik/fix_pr_261 Iwan Kawrakow 2025-03-18 08:43:45 +02:00
  • f8277ced45 Compile time option to use bf16 for qunts without MMQ kernels (#261) Kawrakow 2025-03-18 07:37:10 +01:00
  • bdcae905c4 Compile time option to use bf16 for qunts without MMQ kernels (#261) Kawrakow 2025-03-18 07:37:10 +01:00
  • 9fe2b06f79 FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260) Kawrakow 2025-03-18 07:36:42 +01:00
  • dcdfad29f7 FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260) Kawrakow 2025-03-18 07:36:42 +01:00
  • f326a5eaf7 Compile time option to use bf16 for qunts without MMQ kernels ik/use_bf16_when_no_mmq Iwan Kawrakow 2025-03-17 20:38:20 +02:00
  • b147e31f5a Reduce memory usage for FlashMLA-2 ik/flash_mla2_cuda_no_f32 Iwan Kawrakow 2025-03-17 15:00:26 +02:00
  • b9daa401d7 Be able to compute for more than 65535 tokens Iwan Kawrakow 2025-03-17 12:04:52 +02:00
  • 02f1e22917 Merge remote-tracking branch 'origin/main' into ik/flash_mla2_cuda_no_f32 Iwan Kawrakow 2025-03-17 10:32:29 +02:00
  • 0f19a500a9 Prepare wk_b tensors of DeepSeek models on the fly (#259) Kawrakow 2025-03-17 09:31:56 +01:00
  • f91b2e38d0 Prepare wk_b tensors of DeepSeek models on the fly (#259) Kawrakow 2025-03-17 09:31:56 +01:00
  • 5dbfce3980 FlashMLA-2: avoid conversions to f32 also on CUDA Iwan Kawrakow 2025-03-16 18:00:32 +02:00
  • f2fb15de77 Fix CUDA ik/prepare_wk_b Iwan Kawrakow 2025-03-16 07:40:18 +02:00
  • fc03b9adbc Fix case where wkv_b is quantized with k- or i-quants. Iwan Kawrakow 2025-03-15 18:35:37 +02:00
  • 1324de97d2 Add some comments Iwan Kawrakow 2025-03-15 09:54:36 +02:00
  • 552a9cfb18 Merge remote-tracking branch 'origin/main' into ik/prepare_wk_b Iwan Kawrakow 2025-03-15 09:26:07 +02:00
  • e63117e356 Prepare wk_b when loading DeepSeek models (if wk_b is missing) Iwan Kawrakow 2025-03-15 09:24:06 +02:00
  • 676f0e71b4 FlashMLA-2 (CPU): faster and smaller compute buffer size (#253) Kawrakow 2025-03-13 12:07:43 +02:00
  • 305fabfc3b FlashMLA-2 (CPU): faster and smaller compute buffer size (#253) Kawrakow 2025-03-13 12:07:43 +02:00