Commit Graph

  • 4d3ecb5852 Be able to use IQ4_NL for KV cache on AVX2/Zen4 Iwan Kawrakow 2024-10-01 13:18:35 +03:00
  • 7d9d275fdd CUDA: faster float -> iq4_nl conversion (#73) Kawrakow 2024-10-01 12:28:29 +03:00
  • 8457a26f83 CUDA: faster float -> iq4_nl conversion (#73) Kawrakow 2024-10-01 12:28:29 +03:00
  • f265260f23 Merge remote-tracking branch 'origin/main' into ik/cuda_faster_iq4nl_kvcache ik/cuda_faster_iq4nl_kvcache Iwan Kawrakow 2024-10-01 12:26:53 +03:00
  • cac3f4f5df Speed up float -> iq4_nl conversion on CUDA Iwan Kawrakow 2024-10-01 11:54:35 +03:00
  • ee274a4148 iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72) Kawrakow 2024-10-01 10:56:50 +03:00
  • c2ff4f936a iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72) Kawrakow 2024-10-01 10:56:50 +03:00
  • a6b097c1b1 Fix AVX2 ik/better_iq4_nl Iwan Kawrakow 2024-10-01 10:54:58 +03:00
  • c92f73fc57 iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 Iwan Kawrakow 2024-10-01 10:37:40 +03:00
  • cd1002670c POC SVD: try involving the quantized weights. ik/try_svd Iwan Kawrakow 2024-09-30 21:55:40 +03:00
  • c44347e137 Be able to do SVD before and after quantization Iwan Kawrakow 2024-09-30 19:38:15 +03:00
  • bfd7e090d0 SVD POC: experimenting Iwan Kawrakow 2024-08-19 10:31:42 +03:00
  • 931e4615df SVD POC: multi-threading Iwan Kawrakow 2024-08-14 20:26:40 +03:00
  • 903d389e0f SVD POC: sprinkle some AVX512 Iwan Kawrakow 2024-08-14 18:29:31 +03:00
  • 301bcd4d21 SVD POC: simdify (AVX2) Iwan Kawrakow 2024-08-14 18:12:09 +03:00
  • 4dfc15f92f POC: add ability to try SVD on the difference between model and quantized model Iwan Kawrakow 2024-08-14 17:08:04 +03:00
  • 480a405a9c iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71) Kawrakow 2024-10-01 08:57:34 +03:00
  • 8cba4789da iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71) Kawrakow 2024-10-01 08:57:34 +03:00
  • 5f3f3bb09e iqk_mul_mat: better srategy when nrc_y not divisible by ny ik/better_iqk_strategy Iwan Kawrakow 2024-10-01 08:12:29 +03:00
  • cd7e7b6bbc Allow bf16 kv-cache (#69) Kawrakow 2024-09-29 09:03:52 +03:00
  • fd20638bbc Allow bf16 kv-cache (#69) Kawrakow 2024-09-29 09:03:52 +03:00
  • d12d0e9b04 Allow bf16 kv-cache ik/bf16_kv_cache Iwan Kawrakow 2024-09-29 08:42:33 +03:00
  • f55789e50a Time to fix replace_all (#68) Kawrakow 2024-09-28 17:59:47 +03:00
  • 1b789c983a Time to fix replace_all (#68) Kawrakow 2024-09-28 17:59:47 +03:00
  • c294485f45 Time to fix replace_all ik/fix_replace_all Iwan Kawrakow 2024-09-28 17:43:54 +03:00
  • 54b1c97878 CUDA non-contiguous RoPE (#66) Kawrakow 2024-09-28 17:41:21 +03:00
  • 7abcc6cc0b CUDA non-contiguous RoPE (#66) Kawrakow 2024-09-28 17:41:21 +03:00
  • 147f9606d0 CUDA non-contiguous RoPE ik/non_contiguous_rope Iwan Kawrakow 2024-09-28 14:37:28 +03:00
  • 947a348990 Adding SWIGLU unary op (#65) Kawrakow 2024-09-28 13:37:25 +03:00
  • 737514fd81 Adding SWIGLU unary op (#65) Kawrakow 2024-09-28 13:37:25 +03:00
  • 05cb629007 GGML_UNARY_OP_SWIGLU: cleanup ik/swiglu Iwan Kawrakow 2024-09-28 13:36:27 +03:00
  • c2d0b1ce86 GGML_UNARY_OP_SWIGLU: minor improvement on Metal Iwan Kawrakow 2024-09-28 11:18:54 +03:00
  • a3d1111f65 GGML_UNARY_OP_SWIGLU: Metal implementation Iwan Kawrakow 2024-09-28 11:05:38 +03:00
  • 79a57b1554 GGML_UNARY_OP_SWIGLU: CUDA implementation Iwan Kawrakow 2024-09-28 10:31:59 +03:00
  • c4886b219c Adding GGML_UNARY_OP_SWIGLU Iwan Kawrakow 2024-09-28 10:11:15 +03:00
  • 843de005d6 Better sub-3-bit quantization mixes with a qkv tensor (#64) Kawrakow 2024-09-28 08:17:19 +03:00
  • 1f61e91862 Better sub-3-bit quantization mixes with a qkv tensor (#64) Kawrakow 2024-09-28 08:17:19 +03:00
  • a8f37b61ee Better sub-3-bit quantization mixes with a qkv tensor ik/phi3.5_tweaks Iwan Kawrakow 2024-09-28 08:09:42 +03:00
  • 733660accd Adding ability to have meta data per tensor row (#61) Kawrakow 2024-09-27 08:16:06 +03:00
  • 6dec4af4b6 Adding ability to have meta data per tensor row (#61) Kawrakow 2024-09-27 08:16:06 +03:00
  • d913611605 Play with barriers ik/play_with_barrier Iwan Kawrakow 2024-09-25 19:04:11 +03:00
  • eb197276dd Play with barriers Iwan Kawrakow 2024-09-25 18:27:39 +03:00
  • 70775dac29 Play with barriers Iwan Kawrakow 2024-09-25 17:28:59 +03:00
  • 0bade93228 Update IQ1_TN and IQ2_TN bpw shown to user ik/per_row_scale Iwan Kawrakow 2024-09-25 13:27:39 +03:00
  • 7784c8928f Per row scales - CUDA Iwan Kawrakow 2024-09-25 11:50:01 +03:00
  • ead4c1e180 POC per row scale: add CUDA TODOs Iwan Kawrakow 2024-09-20 19:33:59 +03:00
  • eb2403f057 POC per row scale: CUDA Iwan Kawrakow 2024-09-20 18:58:34 +03:00
  • 3809087a2f iq1_tn: shrink to 1.625 bpw (NEON and Metal) Iwan Kawrakow 2024-09-20 18:14:04 +03:00
  • 7c61747168 Per row scale Metal templates Iwan Kawrakow 2024-09-20 17:58:23 +03:00
  • 9083a50eae POC per row scale: iq2_tn on Metal Iwan Kawrakow 2024-09-19 18:39:34 +03:00
  • d92910b8f7 POC per row scale: iq2_tn on NEON Iwan Kawrakow 2024-09-19 17:51:02 +03:00
  • 86237d0555 POC: per row scale Iwan Kawrakow 2024-09-17 16:04:59 +03:00
  • 1cdb6993ee Use fp32 for K*Q in Metal FA implementation (#62) Kawrakow 2024-09-25 13:08:55 +03:00
  • 546f3ef349 Use fp32 for K*Q in Metal FA implementation (#62) Kawrakow 2024-09-25 13:08:55 +03:00
  • 95d9f3c103 Use fp32 for K*Q in Metal FA implementation ik/fix_metal_fa Iwan Kawrakow 2024-09-25 13:04:10 +03:00
  • 15b231cb74 Minor Kawrakow 2024-09-19 11:14:53 +03:00
  • be57912955 Minor Iwan Kawrakow 2024-09-19 11:14:53 +03:00
  • d373900e99 Fix compiler warnings (#58) Kawrakow 2024-09-17 14:31:29 +03:00
  • 12bbdb8ce7 Fix compiler warnings (#58) Kawrakow 2024-09-17 14:31:29 +03:00
  • 75ac624a7a Fix warnings in iqk_quantize.cpp ik/fix_ggml_common Iwan Kawrakow 2024-09-17 14:22:37 +03:00
  • f811633485 Disable c99-extensions warning only for APPLE Iwan Kawrakow 2024-09-17 13:18:42 +02:00
  • 11346e7f78 Disable c99-extensions warning Iwan Kawrakow 2024-09-17 13:04:22 +02:00
  • 3c56d2a717 Fix C++ compilation warnings caused by ggml-common.h Iwan Kawrakow 2024-09-17 13:48:12 +03:00
  • 5065dcd4a0 Playing with hsums ik/hsums Iwan Kawrakow 2024-09-17 10:52:23 +03:00
  • 07b5d73837 Also apply to iq2_tn Iwan Kawrakow 2024-09-17 09:46:21 +03:00
  • 94cdadd559 Playing with horizontal sums - matrix times vector Iwan Kawrakow 2024-09-17 09:14:19 +03:00
  • 9790b502e6 Playing with horizontal sums Iwan Kawrakow 2024-09-17 08:35:28 +03:00
  • bd4243bfbf BF16 support on Metal (#56) Kawrakow 2024-09-17 10:54:42 +03:00
  • 4ee889f158 BF16 support on Metal (#56) Kawrakow 2024-09-17 10:54:42 +03:00
  • 8e80d15930 Faster BF16 Metal dot product ik/metal_bf16 Iwan Kawrakow 2024-09-16 17:32:48 +02:00
  • c1d0af0a38 BF16 support on Metal Iwan Kawrakow 2024-09-16 17:01:39 +02:00
  • fc8920282f iqk_mul_mat(ARM_NEON): adding bf16 support (#41) Kawrakow 2024-09-16 16:47:36 +03:00
  • 2874b98400 iqk_mul_mat(ARM_NEON): adding bf16 support (#41) Kawrakow 2024-09-16 16:47:36 +03:00
  • e6d3b6b277 iqk_mul_mat(ARM_NEON): adding bf16 support ik/neon_bf16 Iwan Kawrakow 2024-09-05 13:04:03 +02:00
  • 2d532f85d6 Minor Kawrakow 2024-09-15 12:59:14 +03:00
  • 20f3e6fd2d Minor Iwan Kawrakow 2024-09-15 12:59:14 +03:00
  • ba291cbaed Adding bf16 support to CUDA (#40) Kawrakow 2024-09-14 20:02:32 +03:00
  • 6f11c95994 Adding bf16 support to CUDA (#40) Kawrakow 2024-09-14 20:02:32 +03:00
  • 6bfd4511f9 Adapt to latest master ik/cuda_bf16 Iwan Kawrakow 2024-09-14 19:58:39 +03:00
  • b2b16d8d77 Adding bf16 support to CUDA - cleanup Iwan Kawrakow 2024-09-05 11:32:27 +03:00
  • 38ae720676 Adding bf16 support to CUDA - matrix multipications Iwan Kawrakow 2024-09-05 11:21:12 +03:00
  • 2a7623ffc6 Improve Q5_0 performance (#55) Kawrakow 2024-09-14 19:47:26 +03:00
  • 76be98fdec Improve Q5_0 performance (#55) Kawrakow 2024-09-14 19:47:26 +03:00
  • 698c2094bb Improve Q5_0 performance ik/avx2_q5_0 Iwan Kawrakow 2024-09-14 17:19:27 +03:00
  • e833fa76a1 Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54) Kawrakow 2024-09-14 13:53:50 +03:00
  • 064b99365c Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54) Kawrakow 2024-09-14 13:53:50 +03:00
  • 349455d1f8 Improve Q4_0 and Q8_0 performance on AVX2/Zen4 ik/avx2_q4_0_q8_0 Iwan Kawrakow 2024-09-14 13:19:53 +03:00
  • a972f41adb Quantization mixes tweaks (#53) Kawrakow 2024-09-14 10:29:44 +03:00
  • 43b934b19f Quantization mixes tweaks (#53) Kawrakow 2024-09-14 10:29:44 +03:00
  • cb369c22dd Some tweaks for iq2_k and iq3_k ik/qmix_tweaks Iwan Kawrakow 2024-09-13 19:50:46 +03:00
  • f77de99df9 Some tweaks for i-quants Iwan Kawrakow 2024-09-13 19:30:03 +03:00
  • e23dce7a51 Minor Kawrakow 2024-09-13 15:46:36 +03:00
  • ec1cbc8884 Minor Iwan Kawrakow 2024-09-13 15:46:36 +03:00
  • 2bafb03aac Fix bug and D < 128 case for Q8_0 k-cache (#52) Kawrakow 2024-09-13 07:19:47 +03:00
  • f853f6c6a5 Fix bug and D < 128 case for Q8_0 k-cache (#52) Kawrakow 2024-09-13 07:19:47 +03:00
  • ebc88e5d9a Fix bug and D < 128 case for Q8_0 k-cache ik/fix_kq Iwan Kawrakow 2024-09-12 22:04:28 +03:00
  • e25c2e7ec2 Quantized Flash Attention for all supported CPU platforms (#51) Kawrakow 2024-09-12 19:03:20 +03:00
  • 5017f8b3f0 Quantized Flash Attention for all supported CPU platforms (#51) Kawrakow 2024-09-12 19:03:20 +03:00
  • 27fa27daf9 Disallow mixing bf16 with other types for kv caches ik/neon_flash_attention_3 Iwan Kawrakow 2024-09-12 18:55:13 +03:00
  • cdd51579e0 Delete no longer used stuff Iwan Kawrakow 2024-09-12 16:26:18 +03:00