Commit Graph

  • 7e5af2073c Faster MoE inference (#112) Kawrakow 2024-10-31 12:05:27 +01:00
  • 52874c5d21 Faster MoE inference (#112) Kawrakow 2024-10-31 12:05:27 +01:00
  • 23b7da78d1 Metal: speed up mul_mat_id ik/multi_add Iwan Kawrakow 2024-10-31 09:47:24 +01:00
  • 31c13f100f multi_add: Metal Iwan Kawrakow 2024-10-30 12:30:39 +01:00
  • c08af00548 multi_add: simplify Iwan Kawrakow 2024-10-30 11:02:15 +02:00
  • 8af4111f97 multi_add: CUDA Iwan Kawrakow 2024-10-29 15:08:32 +02:00
  • cf28cff1ff multi_sdd: CPU works Iwan Kawrakow 2024-10-29 12:26:34 +02:00
  • 70dd470cbc multi_sdd: WIP Iwan Kawrakow 2024-10-29 12:11:37 +02:00
  • ba3f7a2e94 Use fused mul - unary op also for MoE models (#111) Kawrakow 2024-10-26 18:23:54 +02:00
  • 5ad6439486 Use fused mul - unary op also for MoE models (#111) Kawrakow 2024-10-26 18:23:54 +02:00
  • fe767a45ac Use fused mul - unary op also for MoE models ik/moe_fused_unary Iwan Kawrakow 2024-10-26 19:22:02 +03:00
  • cd96f6c4e5 Bitnet: use the fused mul-silu in the FFN network (#110) Kawrakow 2024-10-26 17:40:32 +02:00
  • 2e5f6db5de Bitnet: use the fused mul-silu in the FFN network (#110) Kawrakow 2024-10-26 17:40:32 +02:00
  • ee9b052414 Bitnet: use the fused mul-silu in the FFN network ik/bitnet_fused_unary Iwan Kawrakow 2024-10-26 18:33:42 +03:00
  • 8ccd9bc7e5 Bitnet CUDA improvements (#109) Kawrakow 2024-10-26 16:26:04 +02:00
  • bd309cb782 Bitnet CUDA improvements (#109) Kawrakow 2024-10-26 16:26:04 +02:00
  • 5e969613e4 iq2_bn(CUDA): quants are not 4-byte aligned ik/bitnet_cuda Iwan Kawrakow 2024-10-26 17:07:54 +03:00
  • fa710abffb iq1_bn: improve CUDA TG Iwan Kawrakow 2024-10-26 16:53:17 +03:00
  • 3b5fa426f1 Improve Bitnet PP on Metal (#108) Kawrakow 2024-10-26 15:13:45 +02:00
  • 3805c84686 Improve Bitnet PP on Metal (#108) Kawrakow 2024-10-26 15:13:45 +02:00
  • 6ffd89eca8 Improve Bitnet PP on Metal ik/bitnet_improve_metal Iwan Kawrakow 2024-10-26 14:47:33 +02:00
  • fdfbd98022 Faster IQ1_BN Metal implementation (#107) Kawrakow 2024-10-26 10:59:59 +02:00
  • f7b05a09dd Faster IQ1_BN Metal implementation (#107) Kawrakow 2024-10-26 10:59:59 +02:00
  • 2d0f9b3663 iq2_bn(Metal): 710 -> 714 t/s for PP-512 ik/iq1bn_metal Iwan Kawrakow 2024-10-26 10:54:43 +02:00
  • ca8f9d7e7e iq1_bn(Metal): 686 -> 702 t/s for PP-512 Iwan Kawrakow 2024-10-26 10:41:14 +02:00
  • 885a48b788 iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128 Iwan Kawrakow 2024-10-26 10:15:43 +02:00
  • a5c3e8839c iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128 Iwan Kawrakow 2024-10-26 10:05:28 +02:00
  • ac0fda624e iq1_bn: faster Metal dot product Iwan Kawrakow 2024-10-26 09:53:24 +02:00
  • 856376a9af Remove forgotten IQ1_TN, IQ2_TN enum values Kawrakow 2024-10-25 14:14:56 +03:00
  • 19cc3329bf Remove forgotten IQ1_TN, IQ2_TN enum values Iwan Kawrakow 2024-10-25 14:14:56 +03:00
  • 4b35340f45 Bitnet changes (#106) Kawrakow 2024-10-25 13:08:43 +02:00
  • 6b968f3894 Bitnet changes (#106) Kawrakow 2024-10-25 13:08:43 +02:00
  • af4255de9c Revert "Avoid rebuild of GGML graph for each token (#98)" ik/adapt_iq1_iq2_bn Iwan Kawrakow 2024-10-25 12:52:59 +02:00
  • 5ccd33ea04 Bitnet: use the standard llm_build_kv to build self attention Iwan Kawrakow 2024-10-24 16:29:26 +03:00
  • d696d64fde Remove iq1_tn and iq2_tn - Part 2 Iwan Kawrakow 2024-10-24 15:34:19 +03:00
  • 5c42877a38 Remove iq1_tn and iq2_tn - Part 1 Iwan Kawrakow 2024-10-24 14:15:03 +02:00
  • 6952e676e2 WIP Iwan Kawrakow 2024-10-24 13:49:20 +02:00
  • 3ba962a68d Adapting iq1_bn, iq2_bn: Metal Iwan Kawrakow 2024-10-24 08:59:33 +02:00
  • 6ef979b7bf Adapting iq1_bn, iq2_bn: NEON Iwan Kawrakow 2024-10-23 20:16:09 +02:00
  • 6191518aac Adapting iq1_bn: CUDA works Iwan Kawrakow 2024-10-23 20:39:24 +03:00
  • fa5bbe53f1 Adapting iq2_bn: CUDA works Iwan Kawrakow 2024-10-23 20:07:13 +03:00
  • 0d17e8c3c7 Adapting iq2_bn: CUDA dequantize Iwan Kawrakow 2024-10-23 19:33:05 +03:00
  • 2db9f1e314 Adapting iq1_bn to work without separate scale tensors Iwan Kawrakow 2024-10-23 18:21:29 +03:00
  • 2e9b3ba92b Adapting iq2_bn to work without separate scale tensors Iwan Kawrakow 2024-10-23 17:57:40 +03:00
  • b535dcd416 Fix quantized k-cache without FA (#105) Kawrakow 2024-10-24 12:20:30 +02:00
  • 9114078959 Fix quantized k-cache without FA (#105) Kawrakow 2024-10-24 12:20:30 +02:00
  • 8957ff4963 This fixes it ik/fix_quantized_k_cache Iwan Kawrakow 2024-10-24 13:16:06 +03:00
  • d64d602faa Added Johannes' changes, still getting NaNs with quantized k-cache. Iwan Kawrakow 2024-10-24 10:44:43 +03:00
  • 33a582466d Add support for Granite and GraniteMoE models (#102) Kawrakow 2024-10-22 17:28:14 +02:00
  • b61cf7d0d7 Add support for Granite and GraniteMoE models (#102) Kawrakow 2024-10-22 17:28:14 +02:00
  • 8834177686 Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication ik/add_granite Iwan Kawrakow 2024-10-22 12:24:15 +03:00
  • 4a952eb16d Add Granite and GranoteMoE models Iwan Kawrakow 2024-10-22 11:56:45 +03:00
  • 0f3a424166 Enable q6_0 for flash attention (#101) Kawrakow 2024-10-22 11:34:49 +02:00
  • 462c6cd7b1 Enable q6_0 for flash attention (#101) Kawrakow 2024-10-22 11:34:49 +02:00
  • 5f3e6faac8 Enable q6_0 for flash attention ik/fattn_enable_q6_0 Iwan Kawrakow 2024-10-21 16:30:10 +03:00
  • 7c5a91daf1 Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99) Kawrakow 2024-10-21 12:16:54 +02:00
  • dbf951df15 Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99) Kawrakow 2024-10-21 12:16:54 +02:00
  • 599a2b7806 Fix typo, which is not really a bug ik/fattn_enable_iq4_nl Iwan Kawrakow 2024-10-21 13:12:14 +03:00
  • 663a889757 Remove file added by mistake Iwan Kawrakow 2024-10-21 11:07:41 +03:00
  • 1322c3f3e5 Add IQ4_NL + IQ4_NL to FA Iwan Kawrakow 2024-10-21 11:06:24 +03:00
  • ca7e403946 Update printour of allowed quantized KV-cache combinations Iwan Kawrakow 2024-10-21 08:32:52 +03:00
  • 67acecce76 We don't need these Iwan Kawrakow 2024-10-20 11:57:25 +03:00
  • 4ee931abf9 Enable IQ4_NL for V-cache in token generation Iwan Kawrakow 2024-10-20 11:49:30 +03:00
  • d336410509 Avoid rebuild of GGML graph for each token (#98) agray3 2024-10-20 07:36:16 +01:00
  • f2d315b46f Avoid rebuild of GGML graph for each token (#98) agray3 2024-10-20 07:36:16 +01:00
  • b091a3513e Bitnet: make the scale tensors optional (#97) Kawrakow 2024-10-19 18:52:58 +02:00
  • afbf2ef3e2 Bitnet: make the scale tensors optional (#97) Kawrakow 2024-10-19 18:52:58 +02:00
  • a3fe796f6c Bitnet: make the scale tensors optional ik/bitnet_optional_scales Iwan Kawrakow 2024-10-19 19:36:44 +03:00
  • b94179b741 Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96) Nexes the Elder 2024-10-19 17:24:43 +02:00
  • a077f09bcb Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96) Nexes the Elder 2024-10-19 17:24:43 +02:00
  • a049537904 Attempt to blindly fix Windows build failure (#93) Kawrakow 2024-10-19 11:43:04 +02:00
  • 7b886ae3d8 Attempt to blindly fix Windows build failure (#93) Kawrakow 2024-10-19 11:43:04 +02:00
  • 0e76d21b96 Adding agray3's graph caching approach ik/cached_graph Iwan Kawrakow 2024-10-18 18:01:08 +03:00
  • e732da1f57 Attempt to blindly fix Windows build failure ik/fix_reduce_windows Iwan Kawrakow 2024-10-18 12:35:47 +03:00
  • c4292bf2d9 iq4_knn: Metal - predictably bad ik/iq4_knn Iwan Kawrakow 2024-10-18 09:16:49 +02:00
  • 780929a6d0 iq4_knn: ARM_NEON Iwan Kawrakow 2024-10-18 08:02:22 +02:00
  • cc912c3f7c iq4_knn: Zen4 Iwan Kawrakow 2024-10-18 11:43:13 +03:00
  • ebb5eb0fc8 iq4_knn: Basics + CUDA Iwan Kawrakow 2024-10-16 18:37:47 +03:00
  • 2b1af6bade CLI - Specify GGML_TYPE to quantize for the main tensors. (#91) Nexes the Elder 2024-10-18 09:48:15 +02:00
  • 03cabe1540 CLI - Specify GGML_TYPE to quantize for the main tensors. (#91) Nexes the Elder 2024-10-18 09:48:15 +02:00
  • f369c6f921 Adding IQ4_KSS: 4.0 bpw quants (#89) Kawrakow 2024-10-16 15:18:26 +03:00
  • 76b97c8064 Adding IQ4_KSS: 4.0 bpw quants (#89) Kawrakow 2024-10-16 15:18:26 +03:00
  • 9612cd79d6 iq4_kss: very slightly faster Metal dot product ik/iq4_kss Iwan Kawrakow 2024-10-16 15:08:15 +03:00
  • 1469b22035 iq4_kss: AVX2 Iwan Kawrakow 2024-10-16 10:18:28 +03:00
  • 7cbe979ee0 iq4_kss: somewhat faster Metal dot product Iwan Kawrakow 2024-10-16 08:47:14 +03:00
  • e01045b02e iq4_kss: Metal Iwan Kawrakow 2024-10-16 08:10:16 +03:00
  • df09b884e0 iq4_kss: ARM_NEON. Predictably very slow Iwan Kawrakow 2024-10-16 07:36:15 +03:00
  • 50fbe44766 iq4_kss: new bit arrangement - CUDA and Zen4 work Iwan Kawrakow 2024-10-15 19:17:45 +03:00
  • 026adac30d iq4_kss: CUDA works Iwan Kawrakow 2024-10-15 15:07:30 +03:00
  • bb0e3f957e iq4_kss: another small quantization improvement Iwan Kawrakow 2024-10-15 14:34:11 +03:00
  • b68c2cb0e0 iq4_kss: slightly better quantization Iwan Kawrakow 2024-10-15 13:49:51 +03:00
  • b159b2b113 iq4_kss: CUDA dequantize works Iwan Kawrakow 2024-10-15 11:48:48 +03:00
  • fd89bf186e iq4_kss: WIP Iwan Kawrakow 2024-10-15 09:40:11 +03:00
  • a09de6eaef iq4_ks: faster dot product on Metal (#90) Kawrakow 2024-10-16 14:13:03 +03:00
  • 993ca95e9e iq4_ks: faster dot product on Metal (#90) Kawrakow 2024-10-16 14:13:03 +03:00
  • 3e0c2519d3 iq4_ks: faster dot product on Metal ik/metal_faster_iq4ks Iwan Kawrakow 2024-10-16 14:04:59 +03:00
  • 1882040c70 Minor iq3_k tweak Kawrakow 2024-10-14 18:13:11 +03:00
  • ff23008ed4 Minor iq3_k tweak Iwan Kawrakow 2024-10-14 18:13:11 +03:00
  • 250c325e7e iq3_k: fix and optimize Metal dot product (#87) Kawrakow 2024-10-14 10:46:41 +03:00
  • 302a6225a1 iq3_k: fix and optimize Metal dot product (#87) Kawrakow 2024-10-14 10:46:41 +03:00