Commit Graph

  • f05484d9a3 FlashMLA-2: eliminate intermediate f32 tensors Iwan Kawrakow 2025-03-12 10:45:36 +02:00
  • fc6a65dda4 MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252) Kawrakow 2025-03-12 07:21:46 +02:00
  • 3f23ed68f1 MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252) Kawrakow 2025-03-12 07:21:46 +02:00
  • 50bbc3f335 FlashMLA(CUDA) - allow q8_0 for KV cache ik/cuda_flash_mla_q8_0 Iwan Kawrakow 2025-03-11 18:41:39 +02:00
  • 31c85a8949 FlashMLA(CUDA) - allow q8_0 for KV cache Iwan Kawrakow 2025-03-11 16:53:05 +02:00
  • 99d9036365 WIP Iwan Kawrakow 2025-03-11 11:36:11 +02:00
  • e0eebfd8ad Try using fp32 for FlashMLA ik/flash_precision Iwan Kawrakow 2025-03-10 19:07:53 +02:00
  • a845e2bfd6 FlashMLA(CUDA): WIP to allow q8_0 quantized cache Iwan Kawrakow 2025-03-10 17:18:23 +02:00
  • 1266d12461 DeepSeek imatrix stuff (#250) Kawrakow 2025-03-10 16:19:09 +02:00
  • a48e163247 DeepSeek imatrix stuff (#250) Kawrakow 2025-03-10 16:19:09 +02:00
  • fcd1e124e0 Faster MoE token generation on CUDA (#248) Kawrakow 2025-03-10 16:16:51 +02:00
  • 699c9cb7f6 Faster MoE token generation on CUDA (#248) Kawrakow 2025-03-10 16:16:51 +02:00
  • 56921ccd49 imatrix: wv_b <-> wkv_b ik/mla_imatrix Iwan Kawrakow 2025-03-10 15:01:56 +02:00
  • cfec33848f Guard against numerical precision issues for MLA on CUDA ik/cuda_faster_moe_tg Iwan Kawrakow 2025-03-09 18:24:15 +02:00
  • 90ab066075 Also do it for plain (not fused) mul_mat_id Iwan Kawrakow 2025-03-09 16:05:21 +02:00
  • 461c3199fe Slightly better Iwan Kawrakow 2025-03-09 15:25:41 +02:00
  • adf8e2af57 This gives us ~20% TG speedup for DeepSeek on CUDA Iwan Kawrakow 2025-03-09 15:04:25 +02:00
  • 46b526c2c4 This works on CUDA, but (#247) Kawrakow 2025-03-09 16:53:55 +02:00
  • b096a5de7a This works on CUDA, but (#247) Kawrakow 2025-03-09 16:53:55 +02:00
  • 1a6712c0ca This works on CUDA, but ik/flash_mla_4 Iwan Kawrakow 2025-03-08 19:41:35 +02:00
  • afa32bdd07 Faster FlashMLA prompt processing (#246) Kawrakow 2025-03-08 19:33:41 +02:00
  • 81748fb55e Faster FlashMLA prompt processing (#246) Kawrakow 2025-03-08 19:33:41 +02:00
  • 8fe22695ee FlashMLA-2: on the CPU it now works also with q8_KV ik/flash_mla_2 Iwan Kawrakow 2025-03-08 13:42:41 +02:00
  • b89e4a37ae FlashMLA-2: on the CPU it now works for quantized cache Iwan Kawrakow 2025-03-08 13:21:59 +02:00
  • bf383c1272 FlashMLA-2: faster prompt processing Iwan Kawrakow 2025-03-08 12:27:07 +02:00
  • 77396a74b5 Better FlashMLA (#243) Kawrakow 2025-03-07 09:46:58 +02:00
  • 3d85a1d663 Better FlashMLA (#243) Kawrakow 2025-03-07 09:46:58 +02:00
  • f8fb8ec9aa Custom quantization rules with regular expressions (#244) Kawrakow 2025-03-07 08:54:09 +02:00
  • c67a37b251 Custom quantization rules with regular expressions (#244) Kawrakow 2025-03-07 08:54:09 +02:00
  • d29f8d3d40 Add the --custom-q option to the help ik/custom_q_rules Iwan Kawrakow 2025-03-06 18:38:32 +02:00
  • 480c2265cd Custom quantization rules with regular expressions Iwan Kawrakow 2025-03-06 17:35:14 +02:00
  • 862b84bb28 Cleanup ik/better_tg_fattn Iwan Kawrakow 2025-03-06 14:59:14 +02:00
  • ba9466138c WIP Iwan Kawrakow 2025-03-06 12:46:12 +02:00
  • 050d5c5eec This is a better FA for TG Iwan Kawrakow 2025-03-06 11:53:51 +02:00
  • a3f6ee27cc DeepSeek CUDA Flash Attention (#241) Kawrakow 2025-03-05 07:27:49 +02:00
  • 7bdbf99bbd DeepSeek CUDA Flash Attention (#241) Kawrakow 2025-03-05 07:27:49 +02:00
  • c5a9bd4bf9 CUDA FA with Dk != Dv: it works now for DeepSeek ik/cuda_fattn_Dk_Dv Iwan Kawrakow 2025-03-04 10:07:09 +02:00
  • f064db93b2 CUDA FA WIP - TG, not working yet. Iwan Kawrakow 2025-03-03 22:23:59 +02:00
  • 47474c1c7e CUDA FA WIP - it now works for Q8_0 + Q8_0 for KV cache Iwan Kawrakow 2025-03-03 19:02:13 +02:00
  • 0a6542b503 CUDA FA WIP - It actually works! Iwan Kawrakow 2025-03-03 18:52:34 +02:00
  • 3c72a7b47c WIP Iwan Kawrakow 2025-03-03 17:59:42 +02:00
  • 4c673e7ace WIP CUDA FA with Dk != Dv Iwan Kawrakow 2025-03-03 16:49:20 +02:00
  • 6719288bf0 Flash MLA (CPU only) (#240) Kawrakow 2025-03-03 15:17:51 +02:00
  • a87e54db6e Flash MLA (CPU only) (#240) Kawrakow 2025-03-03 15:17:51 +02:00
  • 560c6ec7db FlashMLA: that should be it for now ik/flash_mla Iwan Kawrakow 2025-03-03 09:20:30 +02:00
  • 9a31063f31 WIP Iwan Kawrakow 2025-03-03 08:22:19 +02:00
  • f8a7dadbb7 FlashMLA: it now works with iqk Iwan Kawrakow 2025-03-03 07:47:08 +02:00
  • 8f5d81490a WIP Iwan Kawrakow 2025-03-02 17:04:54 +02:00
  • 712484486d It works with ggml FA, not with iqk FA Iwan Kawrakow 2025-03-02 16:21:26 +02:00
  • af91231f93 FlashMLA: allow for f16 and bf16 cache in addition to q8_0 Iwan Kawrakow 2025-03-02 14:22:35 +02:00
  • 16569c670c FlashMLA - it finally works (on the CPU) Iwan Kawrakow 2025-03-02 14:02:40 +02:00
  • 9424c80ab1 SER - Smart Expert Reduction (#239) Kawrakow 2025-03-02 13:47:38 +02:00
  • a89adaa78f SER - Smart Expert Reduction (#239) Kawrakow 2025-03-02 13:47:38 +02:00
  • 101c888724 A better way to measure the cost of ggml_barrier (#238) Kawrakow 2025-03-01 17:12:58 +02:00
  • ef9a3d17b5 A better way to measure the cost of ggml_barrier (#238) Kawrakow 2025-03-01 17:12:58 +02:00
  • 8e612d50c1 Add ser option to llama-bench ik/smart_expert_selection Iwan Kawrakow 2025-03-01 13:25:04 +02:00
  • f5ddc6faa8 Smart expert selection Iwan Kawrakow 2025-03-01 12:15:52 +02:00
  • 23e080f576 A better way to measure the cost of ggml_barrier ik/measure_barriers Iwan Kawrakow 2025-03-01 08:28:49 +02:00
  • e787c00141 Reduce size of compute buffers (#237) Kawrakow 2025-03-01 08:25:27 +02:00
  • a79ab8f342 Reduce size of compute buffers (#237) Kawrakow 2025-03-01 08:25:27 +02:00
  • 84853b9a9b Better concat for contiguous tensors ik/reduce_compute_buffers Iwan Kawrakow 2025-02-28 19:32:36 +02:00
  • 285b97b6bb Much better Iwan Kawrakow 2025-02-28 16:53:02 +02:00
  • efc757c95c This should accomplish it for standard attention Iwan Kawrakow 2025-02-28 14:48:34 +02:00
  • addd8994cd This reduces compute buffer size for MLA Iwan Kawrakow 2025-02-28 14:26:47 +02:00
  • 472b4c37c1 Option to use MLA without a transposed cache (#235) Kawrakow 2025-02-27 16:40:49 +02:00
  • b762db7c92 Option to use MLA without a transposed cache (#235) Kawrakow 2025-02-27 16:40:49 +02:00
  • 407ca33b2a Option to use MLA without a transposed cache ik/mla_no_transposed_cache Iwan Kawrakow 2025-02-27 10:28:54 +02:00
  • ed2599d8a3 Faster MLA on CUDA (#234) Kawrakow 2025-02-27 08:42:18 +02:00
  • 51029edfdf Faster MLA on CUDA (#234) Kawrakow 2025-02-27 08:42:18 +02:00
  • a107d9664c Cleanup ik/cuda_mla2 Iwan Kawrakow 2025-02-27 08:31:29 +02:00
  • 762e5f94fa Much better MLA Iwan Kawrakow 2025-02-26 18:24:08 +02:00
  • 3468438da8 CUDA: Quantize non-contiguous tensors Iwan Kawrakow 2025-02-26 16:19:56 +02:00
  • 78b407122f Slightly better ik/cuda_mla Iwan Kawrakow 2025-02-26 12:10:37 +02:00
  • f1e1820f8d Slight MLA TG performance improvement on CUDA Iwan Kawrakow 2025-02-26 08:56:31 +02:00
  • 85c6152e85 Give the user the option to override where model weights are stored (#232) Kawrakow 2025-02-25 17:55:58 +02:00
  • 94b659a2f1 Give the user the option to override where model weights are stored (#232) Kawrakow 2025-02-25 17:55:58 +02:00
  • 655981cced Add more timing info ik/buffer_type_overrides Iwan Kawrakow 2025-02-25 11:54:59 +02:00
  • b01012e1bd Change long to long long s6/rpc Saood Karim 2025-02-25 00:56:44 -06:00
  • c47ef20fd6 Merge remote-tracking branch 'origin/main' into s6/rpc Saood Karim 2025-02-25 00:38:23 -06:00
  • f16ef779c2 Merge remote-tracking branch 'origin/main' into s6/rpc Saood Karim 2025-02-25 00:25:56 -06:00
  • c2a02dfd09 Add timing info to CUDA graph evaluation Iwan Kawrakow 2025-02-25 08:01:25 +02:00
  • d7ef3a53a7 Fix ggml_nbytes() problem and cleanup Iwan Kawrakow 2025-02-25 07:35:49 +02:00
  • 2572a6de3c Give the user the option to override where model weights are stored Iwan Kawrakow 2025-02-24 16:02:31 +02:00
  • 6ae06d2c5c Fix #230 (#231) Kawrakow 2025-02-24 09:29:58 +02:00
  • 547eee81d9 Fix #230 (#231) Kawrakow 2025-02-24 09:29:58 +02:00
  • 4f2cfd6e8b Fix #230 ik/issue_230 Iwan Kawrakow 2025-02-24 08:05:33 +02:00
  • b50efcc9d2 Fused MoE ffn_up and ffn_gate (#229) Kawrakow 2025-02-23 14:31:11 +02:00
  • ac1d259b93 Fused MoE ffn_up and ffn_gate (#229) Kawrakow 2025-02-23 14:31:11 +02:00
  • cf7b98db88 Adding forgotten gelu, relu, silu on ARM ik/fused_up_gate_unary Iwan Kawrakow 2025-02-23 13:08:32 +02:00
  • a72cd964b0 Add fmoe option to llama-bench Iwan Kawrakow 2025-02-23 11:57:04 +02:00
  • 5bf5467c21 Command line option to enable fused MoE up*unary(gate) Iwan Kawrakow 2025-02-23 11:36:46 +02:00
  • c229183737 On CUDA also fuse MoE down * (up * unary(gate)) Iwan Kawrakow 2025-02-23 09:47:01 +02:00
  • 001abccf73 Fusing MoE up * unary(gate): CUDA Iwan Kawrakow 2025-02-22 19:01:18 +02:00
  • da64673670 Fusing MoE up * unary(gate) Iwan Kawrakow 2025-02-22 17:39:33 +02:00
  • ce1b59f08c Add new sweep-bench benchmark (#225) saood06 2025-02-23 00:16:27 -06:00
  • 46bf73a37f Add new sweep-bench benchmark (#225) saood06 2025-02-23 00:16:27 -06:00
  • 2212c1c636 Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226) Kawrakow 2025-02-23 08:02:16 +02:00
  • 71b7b510c2 Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226) Kawrakow 2025-02-23 08:02:16 +02:00
  • c2d708ae10 Fix JSONL output s6/sweep_bench Saood Karim 2025-02-22 22:30:12 -06:00
  • 55d33a5a91 Fix compilation error with IQK_FA_ALL_QUANTS enabled ik/issue_224 Iwan Kawrakow 2025-02-23 06:12:15 +02:00