ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 15:44:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	4dc97b187b	Fix broken matrix x vector product on Zen4	2024-12-08 16:23:41 +02:00
Iwan Kawrakow	5de1cf4885	Faster iq4_xs_r4 on Zen4 The trick is to simply prepare the Q8 block sums for blocks of 32 as floats. This brings PP-512 up to 254.6 t/s from 224 t/s.	2024-12-08 15:44:49 +02:00
Kawrakow	fc701cedd1	Rename iq4_nl_x4 to iq4_nl_r4 (#126 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-08 09:34:42 +01:00
Kawrakow	ef95b81733	R4 improvements on ARM_NEON (#125 ) * q4_0_r4: 6% faster PP on NEON * qx_0_r4_q8_0 template Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP ~7% faster. * Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4 * Simplify * Minor iq4_xs_r4 improvement on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-08 09:13:10 +01:00
Kawrakow	3682e4700d	iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124 ) * Adding iq2_bn_r4 This Zen4-only implementation achieves PP-512 = 826 t/s (!!!) for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn. * Make sure rows per thread are a multiple of the number of interleaved rows With this I can run iq2_bn_r4 with 32 threads and this increases PP-512 to 872 t/s. * iq2_bn_r4: 1st shot at NEON PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s for Bitnet-1.58b-3B). TG-128 is ~5% slower. * iq2_bn_r4: NEON PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn for 1 thread, but saturates to about the same 93 t/s at 8 threads. * iq2_bn_r4: Experimenting on NEON The matrix x vvector multiplication is erratic. iq2_bn_r4 is faster at 1, 2, and 4 threads, but saturates to a lower t/s at 8 threads compared to iq2_bn. iq2_bn actually manages 99 t/s at 8 threads and not 93 as I wrore in the last commit. iq2_bn_r4 performance has huge fluctuations at 4 and 8 threads. * Some cleanup * iq2_bn_r4: AVX2 As expected, PP is slightly slower as we just don;t have enough vector registers (690 vs 710 t/s). TG is slightly faster (18.2 vs 16.7 t/s at 1 thread). * iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn. * iq2_bn_r4: simdify q8_K16 quantization (AVX2) PP-512 becomes 834 t/s and TG-128 now saturates to the same performance as iq2_bn for 4 threads. * iq2_bn_r4: simdify q8_K16 quantization (NEON) PP-512 is now 304.7 t/s, and TG-128 @ 8 threads very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s) * iq2_bn_r4: fix AVX2 after breaking it two commits ago * iq2_bn_r4: better AVX2 As we don't have enough vector registers on AVX2, it is better to do two passes per row needing only half of the accumulator registers that way. With this, we now beat iq2_bn PP also on AVX2 by a small margin. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-06 12:15:39 +01:00
Kawrakow	f64de08203	IQ4_XS_R4 (#123 ) * Adding iq4_xs_r4 This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4. * iq4_xs_r4: WIP * iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4 * iq4_xs_r4: NEON We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs! * DRY --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-04 15:20:07 +01:00
Kawrakow	f1f4eb988f	Q6_0_R4 (#122 ) * Adding q6_0_r4 We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X. * q6_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max. In terms of ops, q6_0_r4 is identical to q5_0_r4 except for loading the high bits being vld1q_u8_x2 instead of vld1q_u8. It is strange that this can make a 5% difference in performance, especially considering that this is amortized (re-used) over 8 columns in the right matrix. Or am I running out of vector registers? * Fix AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-03 14:48:26 +01:00
Kawrakow	c5bf589367	Q5_0_R4 (#121 ) * Adding q5_0_r4 We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X. We even get TG-128 improvement to 11.7 t/s from 11.1 t/s. * q5_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max, up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp is no longer funny: they get 26.5 t/s for Q5_0. For TG, we are nor able to fully saturate memory bandwidth and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets 20.6 t/s for Q5_0. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-03 12:59:22 +01:00
Kawrakow	ccec00939a	Q8_0_R4 (#120 ) * Adding q8_0_r4 We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared to 175.6 t/s for Q8_0. * q8_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max. * q8_0_r4: Zen4 matrix-vector specialization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-03 06:15:29 +01:00
Kawrakow	239a344f99	Q4_0_R4 (#119 ) * Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-02 17:01:48 +01:00
Kawrakow	6d0462d4a3	IQ4_NL_X4 (#118 ) * Adding iq4_nl_x4 Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s on the Ryzen-7950X! This is faster than any other quant and ~40% faster than iq4_nl. * iq4_nl_x4: getting amazing This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s! * iq4_nl_x4: AVX2 Here we gain only 25% compared to iq4_nl * iq4_nl_x4: NEON On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from 82.4 t/s for iq4_nl. * iq4_nl_x4: minor NEON improvement and cleanup This gets us to 110.3 t/s. In comparison, IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s. * iq4_nl_x4: NEON specialization for matrix x vector --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-12-02 07:25:39 +01:00
Nexes the Elder	8ad84b9fab	Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K (#116 )	2024-11-21 12:01:23 +02:00
Kawrakow	4d2fbde0cb	MMQ for Q6_0 (#115 ) * MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-11-21 07:12:11 +01:00
Kawrakow	52874c5d21	Faster MoE inference (#112 ) * multi_sdd: WIP * multi_sdd: CPU works * multi_add: CUDA * multi_add: simplify * multi_add: Metal * Metal: speed up mul_mat_id For the Granite-1B MoE model PP-512 goes from 156 t/s to 890 t/s, so nearly a 6X speedup! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-31 12:05:27 +01:00
Kawrakow	5ad6439486	Use fused mul - unary op also for MoE models (#111 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 18:23:54 +02:00
Kawrakow	2e5f6db5de	Bitnet: use the fused mul-silu in the FFN network (#110 ) I had forgotten that build_bitnet() does not use the standerd llm_build_ffn function, so the fused mul-silu didn't get used for Bitnet when I added it to llm_build_ffn. This gives us another ~1% speedup for TG-128. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 17:40:32 +02:00
Kawrakow	bd309cb782	Bitnet CUDA improvements (#109 ) * iq1_bn: improve CUDA TG On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s. I see I have on the front page 301 t/s, so pretty nice improvement since then. * iq2_bn(CUDA): quants are not 4-byte aligned --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 16:26:04 +02:00
Kawrakow	3805c84686	Improve Bitnet PP on Metal (#108 ) iq1_bn goes from 702 t/s to 716 t/s iq2_bn goes from 714 t/s to 743 t/s Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 15:13:45 +02:00
Kawrakow	f7b05a09dd	Faster IQ1_BN Metal implementation (#107 ) * iq1_bn: faster Metal dot product 82 t/s -> 87.9 t/s * iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128 * iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128 So, total improvement is ~15%. Not bad. * iq1_bn(Metal): 686 -> 702 t/s for PP-512 * iq2_bn(Metal): 710 -> 714 t/s for PP-512 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 10:59:59 +02:00
Iwan Kawrakow	19cc3329bf	Remove forgotten IQ1_TN, IQ2_TN enum values	2024-10-25 14:14:56 +03:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	9114078959	Fix quantized k-cache without FA (#105 ) * Added Johannes' changes, still getting NaNs with quantized k-cache. Also getting NaN's on Johannes's mainline branch. * This fixes it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-24 12:20:30 +02:00
Kawrakow	b61cf7d0d7	Add support for Granite and GraniteMoE models (#102 ) * Add Granite and GranoteMoE models * Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 17:28:14 +02:00
Kawrakow	462c6cd7b1	Enable q6_0 for flash attention (#101 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 11:34:49 +02:00
Kawrakow	dbf951df15	Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99 ) * Enable IQ4_NL for V-cache in token generation * We don't need these * Update printour of allowed quantized KV-cache combinations * Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. * Remove file added by mistake * Fix typo, which is not really a bug --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-21 12:16:54 +02:00
agray3	f2d315b46f	Avoid rebuild of GGML graph for each token (#98 ) Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.	2024-10-20 08:36:16 +02:00
Kawrakow	afbf2ef3e2	Bitnet: make the scale tensors optional (#97 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 18:52:58 +02:00
Nexes the Elder	a077f09bcb	Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96 ) * attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S Pattern worth to be tested on more quants and on L3 8B. PPL 512 = -0.024 for 70b ; - 0.005 for 8b Size = - 640MiB for 70b ; - 64MiB for 8b 70b Q5_K_S now beats Q5_K_M by -0.012 ppl I suspect that it goes for L3 as well, which was quite insensitive to attn_q quantization. * indent	2024-10-19 17:24:43 +02:00
Kawrakow	7b886ae3d8	Attempt to blindly fix Windows build failure (#93 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 11:43:04 +02:00
Nexes the Elder	03cabe1540	CLI - Specify GGML_TYPE to quantize for the main tensors. (#91 ) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up	2024-10-18 09:48:15 +02:00
Kawrakow	76b97c8064	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00
Kawrakow	993ca95e9e	iq4_ks: faster dot product on Metal (#90 ) TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 14:13:03 +03:00
Iwan Kawrakow	ff23008ed4	Minor iq3_k tweak	2024-10-14 18:13:11 +03:00
Kawrakow	302a6225a1	iq3_k: fix and optimize Metal dot product (#87 ) * iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-14 10:46:41 +03:00
Kawrakow	baab1d9a1e	Fix and optimize iq2k Metal implementation (#86 ) * I somehow broke iq2_k on Metal? - fix dequantize * I somehow broke iq2_k on Metal? - fix dot product * iq2_k: optimize Metal dot product 42.6 t/s -> 46.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 14:30:30 +03:00
Kawrakow	910a134094	IQ2_KS: 2.1875 bpw non-linear quantization (#85 ) * Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 13:34:30 +03:00
Iwan Kawrakow	c15de3654e	Minor: printf -> LLAMA_LOG_INFO	2024-10-11 12:49:47 +03:00
Kawrakow	70aca0b75c	Better model info (#84 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-10 18:21:24 +03:00
Kawrakow	b30c9e10d8	New SOTA quantization: 4.25 bpw IQ4_KS (#83 ) * iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-09 12:54:40 +03:00
Iwan Kawrakow	c0ddc644bb	Fix compiler warnings	2024-10-04 16:17:36 +03:00
Kawrakow	fe36930c8b	Move scale fudge factors to quantization (#81 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 16:16:01 +03:00
Kawrakow	bc79091b0e	Move to c++17 projectwide (#80 ) * Slightly better * Make the entire project c++17 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 14:43:26 +03:00
Kawrakow	0bf4d99774	Do not quantize activations if not necessary (#79 ) * Do not quantize activations if not necessary * Do not quantize activations if not necessary also for MoE models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 11:22:57 +03:00
Kawrakow	ba392802ef	q6_0: Slightly faster Zen4/AVX2 (#78 ) * Faster q6_0 on AVX2 PP-512 goes up by 3.4%. * q6_0: this is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 18:09:47 +03:00
Kawrakow	50b5e90112	Fused unary(x)y (#70 ) Adding fused yunary(x) op Fused yunary(x) op: CUDA Fused yunary(x) op: dedicated CPU implementation for silu and gelu Fused y*unary(x) op: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 17:05:56 +03:00
Kawrakow	cce49832c1	Adding Q6_0 (#77 ) * Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. \| model \| size \| params \| backend \| ngl \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| ------: \| ------------: \| ---------------: \| \| llama 8B Q6_0 \| 6.08 GiB \| 8.03 B \| Metal \| 100 \| 4 \| tg128 \| 44.02 ± 0.08 \| \| llama 8B Q5_0 \| 5.21 GiB \| 8.03 B \| Metal \| 100 \| 4 \| tg128 \| 40.13 ± 0.12 \| \| llama 8B Q6_0 \| 6.08 GiB \| 8.03 B \| Metal \| 100 \| 4 \| pp512 \| 500.55 ± 0.32 \| \| llama 8B Q5_0 \| 5.21 GiB \| 8.03 B \| Metal \| 100 \| 4 \| pp512 \| 448.02 ± 0.27 \| * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 15:22:13 +03:00
Kawrakow	d6909ed6f0	iq4_nl: faster quantization (#76 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 08:17:00 +03:00
Kawrakow	0999f77e5b	Fix Q5_0 flash attention (#75 ) When I changed iqk_mul_mat to use type-1 dot products for type-0 legacy quants, I forgot to also change the vec_dot_type when the dot product is done via ggml as in flash attention. This commit fixes it. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 15:52:35 +03:00
Iwan Kawrakow	970df4b467	Fix last commit Did not re-check on AVX2/Zen4 after NEON related changes and, sure enough, I broke AVX2/Zen4.	2024-10-01 14:48:44 +03:00
Kawrakow	e7f5a86a41	IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74 ) * Be able to use IQ4_NL for KV cache on AVX2/Zen4 * Be able to use IQ4_NL for KV cache on ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 14:46:40 +03:00

1 2 3 4 5 ...

3493 Commits