Commit Graph

  • 0a17ff156f Tidy up FlashMS Iwan Kawrakow 2024-09-12 14:50:50 +03:00
  • 7c4bc981dc AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 Iwan Kawrakow 2024-09-12 13:14:37 +03:00
  • 3539e4caa2 Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 Iwan Kawrakow 2024-09-12 12:55:52 +03:00
  • 4ff2c6d188 NEON Flash Attention: quantized K*Q for q8_0 Iwan Kawrakow 2024-09-12 10:38:44 +02:00
  • c3dc5a27bb NEON Flash Attention: quantized K*Q for q4_1 Iwan Kawrakow 2024-09-12 09:57:04 +02:00
  • 0b6d6541d7 NEON Flash Attention: quantized K*Q for q4_0 Iwan Kawrakow 2024-09-12 09:12:35 +02:00
  • 2dee479c44 NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 Iwan Kawrakow 2024-09-12 06:26:37 +02:00
  • 7874e4425f AVX2 Flash Attention 2 (#50) Kawrakow 2024-09-11 19:55:42 +03:00
  • c920195edd AVX2 Flash Attention 2 (#50) Kawrakow 2024-09-11 19:55:42 +03:00
  • 95fe6923ad Fix Zen4 ik/avx2_flash_attn_2 Iwan Kawrakow 2024-09-11 19:44:34 +03:00
  • 70f3a8882c AVX2 Flash Attention: add ability to use Q4_1 for kv-cache Iwan Kawrakow 2024-09-11 19:37:16 +03:00
  • 62809e7097 AVX2 Flash Attention: add ability to use Q4_0 for kv-cache Iwan Kawrakow 2024-09-11 18:39:05 +03:00
  • 89d23a3176 AVX2 Flash Attention: add ability to use Q8_0 for kv-cache Iwan Kawrakow 2024-09-11 17:34:13 +03:00
  • dd52475507 ARM_NEON Flash Attention (#49) Kawrakow 2024-09-11 10:26:49 +03:00
  • d98a6753a6 ARM_NEON Flash Attention (#49) Kawrakow 2024-09-11 10:26:49 +03:00
  • d063007d24 Delete commented out stuff ik/neon_flash_attention_2 Iwan Kawrakow 2024-09-11 08:50:45 +02:00
  • 01b7a3a981 NEON Flash Attention - use fp32 for K*Q operations Iwan Kawrakow 2024-09-11 08:37:26 +02:00
  • 2eb9e212be NEON Flash Attention - convert Q to f16 before computing Q*K Iwan Kawrakow 2024-09-11 07:05:52 +02:00
  • 67bf083f9d NEON Flash Attention - first working version Iwan Kawrakow 2024-09-10 19:01:03 +02:00
  • 808f08787a AVX2 Flash Attention (#48) Kawrakow 2024-09-10 19:17:04 +03:00
  • 72f5dfe12a AVX2 Flash Attention (#48) Kawrakow 2024-09-10 19:17:04 +03:00
  • e3919f5f80 Fix ARM_NEON ik/avx2_flash_attn Iwan Kawrakow 2024-09-10 18:14:59 +02:00
  • 1b12b2658a Try smaller q_step - no improvement Iwan Kawrakow 2024-09-10 18:54:05 +03:00
  • 33b77fc98e Fix Zenn4 parts broken via the AVX2 change Iwan Kawrakow 2024-09-10 18:19:29 +03:00
  • 145df42c53 First version of AVX2 Flash attention Iwan Kawrakow 2024-09-10 17:54:04 +03:00
  • 49cbbc9fee iq2_tn: slightly better performance on AVX2 (#47) Kawrakow 2024-09-10 16:21:57 +03:00
  • d17d0c4426 iq2_tn: slightly better performance on AVX2 (#47) Kawrakow 2024-09-10 16:21:57 +03:00
  • 65555e504c iq2_tn: slightly better performance on AVX2 ik/iq2_tn_avx2 Iwan Kawrakow 2024-09-10 13:54:20 +03:00
  • 8cb5e74e26 iq2_tn: reuse iq2_bn implementation (Zen4) ik/iq2_tn_as_iq2_bn Iwan Kawrakow 2024-09-10 10:39:09 +03:00
  • e486643d59 IQ1_TN Metal implementation (#46) Kawrakow 2024-09-10 09:43:05 +03:00
  • a1f7a03f50 IQ1_TN Metal implementation (#46) Kawrakow 2024-09-10 09:43:05 +03:00
  • 7d8e49ef1b Some cleanup ik/iq1_tn_metal Iwan Kawrakow 2024-09-10 08:08:19 +02:00
  • d344a7b8ff iq1_tn: Metal implementation Iwan Kawrakow 2024-09-10 07:57:05 +02:00
  • cc164dc85d Add CUDA support for IQ1_TN (#45) Kawrakow 2024-09-09 21:17:17 +03:00
  • 918ada20fa Add CUDA support for IQ1_TN (#45) Kawrakow 2024-09-09 21:17:17 +03:00
  • a9b15ed82e Delete forgotten TODO ik/iq1_tn_cuda Iwan Kawrakow 2024-09-09 20:10:11 +03:00
  • db41d67243 Delete commented out stuff Iwan Kawrakow 2024-09-09 20:08:07 +03:00
  • f808466309 iq1_tn: adding CUDA dot product Iwan Kawrakow 2024-09-09 20:00:14 +03:00
  • 5b40848742 iq1_tn: adding CUDA dequantize Iwan Kawrakow 2024-09-09 18:42:06 +03:00
  • 4a5d5e207d Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44) Kawrakow 2024-09-09 14:56:34 +03:00
  • 8c86231f93 Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44) Kawrakow 2024-09-09 14:56:34 +03:00
  • 237a2380ee Remove unnecessary barrier in ggml_compute_forward_mul_mat ik/iq1_tn Iwan Kawrakow 2024-09-09 12:53:23 +03:00
  • e9bb1a54ee iq2_bn: improve AVX2 implementation Iwan Kawrakow 2024-09-09 12:38:36 +03:00
  • 61738aa7a2 iq2_bn: improve on Zen4 Iwan Kawrakow 2024-09-09 11:03:12 +03:00
  • 41c8200d08 iq1_tn: improve Zen4 Iwan Kawrakow 2024-09-09 09:02:33 +03:00
  • 45db1385ef iq1_tn: improve AVX2 Iwan Kawrakow 2024-09-09 08:22:24 +03:00
  • c85d7bef55 iq2_bn: improve performance on NEON Iwan Kawrakow 2024-09-09 06:46:25 +02:00
  • 3487e68cc0 iq1_tn: faster NEON Iwan Kawrakow 2024-09-08 21:30:06 +02:00
  • 8d509a7d71 iq1_tn: NEON Iwan Kawrakow 2024-09-08 17:40:08 +02:00
  • c82bf200ce Adding iq1_tn - 1.6875 bpw for TriLM ternary models Iwan Kawrakow 2024-09-08 17:56:15 +03:00
  • f2ef628e4e iq2_tn: slightly faster PP (#43) Kawrakow 2024-09-08 12:41:44 +03:00
  • bf4b19b474 iq2_tn: slightly faster PP (#43) Kawrakow 2024-09-08 12:41:44 +03:00
  • b7f7eede8a iq2_tn: slightly faster PP ik/iq2_tn_faster_pp Iwan Kawrakow 2024-09-08 12:26:43 +03:00
  • d5aa49b93b Adding fused rms_norm (#42) Kawrakow 2024-09-08 10:19:21 +03:00
  • 6136a4b803 Adding fused rms_norm (#42) Kawrakow 2024-09-08 10:19:21 +03:00
  • d2225010b9 Fused rms_norm WIP ik/fused_rms_norm Iwan Kawrakow 2024-09-08 07:06:38 +02:00
  • 3c46f6f817 Fused rms_norm WIP Iwan Kawrakow 2024-09-08 06:48:13 +02:00
  • 03fa830c5f Fused rms_norm WIP Iwan Kawrakow 2024-09-08 06:38:58 +03:00
  • 5bbbfc62da Fused rms_norm WIP Iwan Kawrakow 2024-09-07 23:04:31 +03:00
  • 889cda0bba Fused rms_norm WIP Iwan Kawrakow 2024-09-07 21:38:49 +03:00
  • 4d5c76b977 Fused rms_norm: works on the CPU Iwan Kawrakow 2024-09-07 19:21:51 +03:00
  • 18f5bb47d8 Add support for bf16 to iqk_mul_mat (#39) Kawrakow 2024-09-05 07:48:27 +03:00
  • 0087008d29 Add support for bf16 to iqk_mul_mat (#39) Kawrakow 2024-09-05 07:48:27 +03:00
  • 8d47523e7e Improve TG speed (when not memory bound) ik/mul_mat_bf16 Iwan Kawrakow 2024-09-04 19:37:05 +03:00
  • a4c55558d3 Minor Iwan Kawrakow 2024-09-04 19:13:33 +03:00
  • 357c95ee49 WIP: adding BF16 support to iqk_mul_mat Iwan Kawrakow 2024-09-04 18:47:38 +03:00
  • 02e4cc0c18 Zen4 Flash Attention - bf16 support (#38) Kawrakow 2024-09-05 07:46:47 +03:00
  • 7b1b2b2c06 Zen4 Flash Attention - bf16 support (#38) Kawrakow 2024-09-05 07:46:47 +03:00
  • c624232525 Zen4 Flash Attnetion: improving bf16 ik/zen4_flash_attn_bf16 Iwan Kawrakow 2024-09-04 17:44:29 +03:00
  • fd6de3eb9c Zen4 Flash Attnetion: improving bf16 Iwan Kawrakow 2024-09-04 15:22:12 +03:00
  • ec0b5f5062 Zen4 Flash Attnetion: bf16 seems to be working Iwan Kawrakow 2024-09-04 14:57:42 +03:00
  • 8218e77dec Zen4 Flash Attnetion: WIP bf16 Iwan Kawrakow 2024-09-04 13:09:24 +03:00
  • d47e1c63b3 Performance improvements for legacy quants on ARM_NEON (#37) Kawrakow 2024-09-04 07:24:04 +03:00
  • f17d0d72f5 Performance improvements for legacy quants on ARM_NEON (#37) Kawrakow 2024-09-04 07:24:04 +03:00
  • 9d3460446d WIP: trying to improve legacy quants ik/neon_improve_legacy_quants Iwan Kawrakow 2024-09-03 16:18:18 +02:00
  • 355210a1bb WIP: trying to improve legacy quants Iwan Kawrakow 2024-09-03 15:35:30 +02:00
  • 0449090ae8 Zen4 Flash Attnetion 2 (#36) Kawrakow 2024-09-04 07:20:55 +03:00
  • 8c94dcd433 Zen4 Flash Attnetion 2 (#36) Kawrakow 2024-09-04 07:20:55 +03:00
  • fffd040281 Delete unused stuff ik/zen4_flash_attn_2 Iwan Kawrakow 2024-09-03 13:12:33 +03:00
  • 9b6f9bb76c Zen4 Flash Attnetion: add q4_1 Iwan Kawrakow 2024-09-03 11:13:28 +03:00
  • e73835de95 Zen4 Flash Attnetion: small q8_0 performance improvement Iwan Kawrakow 2024-09-03 10:56:20 +03:00
  • a4256004a8 Zen4 Flash Attnetion: it works for q4_0 and q8_0 Iwan Kawrakow 2024-09-03 10:17:49 +03:00
  • 80d74b32e4 Zen4 Flash Attnetion: WIP generalize to other types Iwan Kawrakow 2024-09-02 19:40:06 +03:00
  • 724854e7db Fix Zen4 Flash Attention (#35) Kawrakow 2024-09-02 15:54:24 +03:00
  • 9b53c2533f Fix Zen4 Flash Attention (#35) Kawrakow 2024-09-02 15:54:24 +03:00
  • de91911d7a Fix Zen4 Flash Attention ik/fix_flash_attn Iwan Kawrakow 2024-09-02 15:51:10 +03:00
  • 2db35edf71 Do not process prompts containing binary data for escapes (#33) Kawrakow 2024-09-02 09:18:48 +03:00
  • 5518e24be8 Do not process prompts containing binary data for escapes (#33) Kawrakow 2024-09-02 09:18:48 +03:00
  • 6bc273c1d6 Do not process prompts containing binary data for escapes ik/fix_multiple_choice Iwan Kawrakow 2024-09-02 09:12:08 +03:00
  • 59c2e77869 Zen4 Flash Attention (#32) Kawrakow 2024-09-01 16:08:21 +03:00
  • dc023bc3be Zen4 Flash Attention (#32) Kawrakow 2024-09-01 16:08:21 +03:00
  • a66d1fc562 Update FlashAttn comment ik/zen4_flash_attn Iwan Kawrakow 2024-09-01 12:47:50 +03:00
  • 94439ea73c Flass attention refinements Iwan Kawrakow 2024-09-01 11:16:07 +03:00
  • 8694617908 Add flash attention with soft-cap and fix D = 256 case Iwan Kawrakow 2024-08-31 19:03:40 +03:00
  • 412c6a98af Zen4 flash attention: moving useful parts from the kq_fused_softmax branch Iwan Kawrakow 2024-08-31 17:05:50 +03:00
  • 1b834ac6e4 Flash attention: templated implementation ik/kq_fused_softmax Iwan Kawrakow 2024-08-31 13:10:36 +03:00
  • 5ff997021e Fix build when iqk_mul_mat is disabled (#31) Kawrakow 2024-08-31 09:11:42 +03:00
  • dbb1db9899 Fix build when iqk_mul_mat is disabled (#31) Kawrakow 2024-08-31 09:11:42 +03:00
  • eb046ae5f3 Fix build when iqk_mul_mat is disabled ik/fix_no_iqk_build Iwan Kawrakow 2024-08-31 08:52:08 +03:00
  • 6d9510c680 WIP Iwan Kawrakow 2024-08-31 07:15:38 +03:00