This website requires JavaScript.
Explore
Help
Register
Sign In
ikawrakow
/
ik_llama.cpp
Watch
1
Star
0
Fork
0
You've already forked ik_llama.cpp
mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced
2026-02-23 22:54:10 +00:00
Code
Issues
Packages
Projects
Releases
Wiki
Activity
Files
f5ac78de5cf6d0b4786d2496c2ead70e41b4cd9e
ik_llama.cpp
/
github-data
/
pull_requests
History
Thomas
0451f10a42
Add GitHub data: filename sanitization (
#640
)
2025-07-23 13:31:53 +02:00
..
1 - Offload Bitnet token embeddings to the GPU.md
…
2 - Offload Bitnet token embeddings to the GPU - the right way.md
…
3 - Merge mainline llama.cpp.md
…
4 - Simdify and multi-thread tanh.md
…
5 - Fusing a mat mul op followed by a scale op on the CPU.md
…
6 - IQ4_K_ SOTA 4-bit quantization.md
…
7 - Adding IQ2_K_ IQ3_K and IQ5_K.md
…
9 - Fused soft cap and SIMD-ified GeLU.md
…
10 - iq4_k_ speedup quantization by a factor of _2.md
…
11 - Faster iq3_k and iq5_k quantization.md
…
12 - q2_K_ allow it to detect ternary nets and quantize accordingly.md
…
13 - Adding IQ2_TN for use with ternary models.md
…
14 - Adding IQ6_K.md
…
16 - Fix Makefile.md
…
17 - Merge mainline - Aug 12 2024.md
…
19 - Skip barriers of noops.md
…
20 - iq2_k_ slightly better bpw - accuracy compromise.md
…
21 - quantize_stats_ print rmse and max error as fraction of _x_.md
…
22 - AVX2 quantization for Q8_K.md
…
23 - iq4_k tweak.md
…
24 - softcap_ minor improvement.md
…
27 - Faster Gemma2.md
…
28 - Binary KQ mask.md
…
31 - Fix build when iqk_mul_mat is disabled.md
…
32 - Zen4 Flash Attention.md
…
33 - Do not process prompts containing binary data for escapes.md
…
35 - Fix Zen4 Flash Attention.md
…
36 - Zen4 Flash Attnetion 2.md
…
37 - Performance improvements for legacy quants on ARM_NEON.md
…
38 - Zen4 Flash Attention - bf16 support.md
…
39 - Add support for bf16 to iqk_mul_mat.md
…
40 - Adding bf16 support to CUDA.md
…
41 - iqk_mul_mat_ARM_NEON_ adding bf16 support.md
…
42 - Adding fused rms_norm.md
…
43 - iq2_tn_ slightly faster PP on Zen4.md
…
44 - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models.md
…
45 - Add CUDA support for IQ1_TN.md
…
46 - IQ1_TN Metal implementation.md
…
47 - iq2_tn_ slightly better performance on AVX2.md
…
48 - AVX2 Flash Attention.md
…
49 - ARM_NEON Flash Attention.md
…
50 - AVX2 Flash Attention 2.md
…
51 - Quantized Flash Attention for all supported CPU platforms.md
…
52 - Fix bug and D _ 128 case for Q8_0 k-cache.md
…
53 - Quantization mixes tweaks.md
…
54 - Improve Q4_0 and Q8_0 performance on AVX2_Zen4.md
…
55 - Improve Q5_0 performance on AVX2.md
…
56 - BF16 support on Metal.md
…
57 - AVX2_Zen4 horizontal sums.md
…
58 - Fix compiler warnings.md
…
61 - Adding ability to have meta data per tensor row.md
…
62 - Use fp32 for K_Q in Metal FA implementation.md
…
64 - Better sub-3-bit quantization mixes with a qkv tensor.md
…
65 - Adding SWIGLU unary op.md
…
66 - CUDA non-contiguous RoPE.md
…
68 - It is time to fix replace_all.md
…
69 - Allow bf16 kv-cache.md
…
70 - Fused unary_x_y.md
…
71 - iqk_mul_mat_ better srategy when nrc_y not divisible by ny.md
…
72 - iqk_mul_mat_ better iq4_nl implementation on Zen4_AVX2.md
…
73 - CUDA_ faster float -_ iq4_nl conversion.md
…
74 - IQ4_NL kv-cache on the CPU _Zen4_AVX2_ARM_NEON_.md
…
75 - Fix Q5_0 flash attention.md
…
76 - iq4_nl_ faster quantization.md
…
77 - Adding Q6_0.md
…
78 - q6_0_ Slightly faster Zen4_AVX2.md
…
79 - Do not quantize activations if not necessary.md
…
80 - Move to c_17 projectwide.md
…
81 - Cleanup scale fudge factors.md
…
83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md
…
84 - Better model info.md
…
85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md
…
86 - Fix and optimize iq2k Metal implementation.md
…
87 - iq3_k_ fix and optimize Metal dot product.md
…
89 - Adding IQ4_KSS_ 4.0 bpw quants.md
…
90 - iq4_ks_ faster dot product on Metal.md
…
91 - CLI - Specify GGML_TYPE to quantize for the main tensors..md
…
93 - Attempt to blindly fix Windows build failure.md
…
94 - Adding _agray3_s graph caching approach.md
…
96 - Quant strategies_ attn_q Q4 _ attn_v Q6 for Llama 3.1 Q5_K_S.md
…
97 - Bitnet_ make the scale tensors optional.md
…
98 - Avoid rebuild of GGML graph for each token.md
…
99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md
…
101 - Enable q6_0 in flash attention.md
…
102 - Add support for Granite and GraniteMoE models.md
…
105 - Fix quantized k-cache without FA.md
…
106 - Bitnet changes.md
…
107 - Faster IQ1_BN Metal implementation.md
…
108 - Another Bitnet performance improvement on Metal.md
…
109 - Bitnet CUDA improvements.md
…
110 - Bitnet_ use the fused mul-silu in the FFN network.md
…
111 - Use fused mul - unary op also for MoE models.md
…
112 - Faster MoE inference.md
…
113 - Trellis quantization.md
…
114 - MMQ Kernel for Q6_0 _pretty please_.md
…
115 - MMQ for Q6_0.md
…
116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K_Q5_K.md
…
117 - Some minor quant strategies tweaks.md
…
118 - IQ4_NL_X4.md
…
119 - Q4_0_R4.md
…
120 - Q8_0_R4.md
…
121 - Q5_0_R4.md
…
122 - Q6_0_R4.md
…
123 - IQ4_XS_R4.md
…
124 - iq2_bn_r4_ fastest Bitnet CPU implementation on the planet.md
…
125 - R4 improvements on ARM_NEON.md
…
126 - Rename iq4_nl_x4 to iq4_nl_r4.md
…
127 - Q4_0_R4 on CUDA.md
…
128 - Faster IQ4_XS_R4 on Zen4.md
…
129 - Q4_K_R4.md
…
130 - Q6_K_R4.md
…
131 - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4.md
…
132 - Q5_K_R4.md
…
134 - Q3_K_R4.md
…
135 - Better ARM_NEON implementation for R4 quants.md
…
136 - Q2_K_R4.md
…
137 - Fix AVX2 implementation of iq4_nl_r4.md
…
138 - IQ4_K_R4.md
…
139 - Faster R4 quants on Zen4.md
…
141 - Q8_K_R8_ Fastest quantized matrix multiplications.md
…
142 - BF16_R16 - 16 interleaved bf16 rows.md
…
143 - Slightly faster IQ4_XS_R4 on AVX2.md
…
144 - Slightly faster IQ4_K_R4 on AVX2_Zen4.md
…
145 - IQ3_K_R4.md
…
146 - IQ2_K_R4.md
…
147 - Be able to repack tensors at run time.md
…
148 - Slightly better matrix x vector on Zen4_AVX2 for iq2_k_r4_ iq3_k_r4_ iq.md
…
149 - IQ5_K_R4.md
…
150 - IQ4_KS_R4.md
…
151 - fix typo.md
…
152 - IQ3_XXS_R4.md
…
153 - IQ3_XXS_R4.md
…
154 - IQ2_XXS_R4.md
…
155 - IQ2_XS_R4.md
…
156 - IQ2_S_R4.md
…
157 - R4 i-quants improvements.md
…
158 - Faster R4 legacy quants.md
…
161 - MSVC fixes.md
…
162 - IQ3_S_R4.md
…
163 - q4_0_r4_ Use AVX2 version for matrix x vector.md
…
168 - Falcon3 changes.md
…
169 - Be able to re-quantize MS BitNet I2_S models.md
…
170 - MoE fix for R4 quants.md
…
171 - Fix lower FA performance for even batch sizes.md
…
172 - CPU Flash Attention improvements.md
…
173 - More Flash Attention improvements.md
…
174 - On Zen4 repack fp16 models to bf16_r16.md
…
175 - Better BF16 support on AVX2.md
…
176 - Deepseek V3 support added.md
…
177 - Update chat templates.md
…
178 - Interleave 8 rows _Q8_0_ IQ4_XS_.md
…
179 - Minor performance improvements.md
…
180 - Deepseek MLA Optimizations.md
…
181 - Various.md
…
182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2_Zen4.md
…
184 - Deepseek-Lite.md
…
185 - IQ1_S_R4_ better 1.5 bpw quants.md
…
186 - iq1_s_r4_ slightly faster NEON gemm_gemv.md
…
187 - IQ1_M_R4_ better 1.75 bpw quants.md
…
188 - Add optional MLA.md
…
189 - Rename q4_0_r4_ q8_0_r4 and iq4_xs_r4 to _r8.md
…
190 - cuda_ non-contiguous rms norm.md
…
191 - Add additional checks for iq1_s_r4 quantization.md
…
192 - Revert _79.md
…
193 - RPC sync.md
…
194 - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications.md
…
195 - Deepseek MLA Optimizations V2.md
…
197 - FA_ Add option to build all FA kernels.md
…
198 - Load all MoE experts during warmup and make warmup 1 token.md
…
200 - DeepSeek FA support _CPU only_.md
…
202 - Fix imatrix overprotectiveness.md
…
204 - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support.md
…
205 - Faster MLA prompt processing.md
…
206 - MLA_ allow Q8_0 K-cache for MLA.md
…
207 - Faster CPU TG for GQA models.md
…
208 - Q8_KV_ 8-bit quantization type targeting the KV cache.md
…
210 - Repack also experts.md
…
212 - Optimized GEMM_GEMV for IQ1_S.md
…
213 - Fix NEON gemm_gemv for legacy quants when row size is not divisible by .md
…
215 - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512.md
…
216 - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD.md
…
218 - Better strategy for attention matrix multiplications when generating to.md
…
219 - Fuse MoE up and gate matrix multiplications.md
…
220 - Fix _217.md
…
225 - Examples _ Add new sweep-bench benchmark.md
…
226 - Fix compilation error with IQK_FA_ALL_QUANTS enabled.md
…
229 - Fused MoE ffn_up and ffn_gate.md
…
231 - Fix _230.md
…
232 - Give the user the option to override where model weights are stored.md
…
233 - Slightly faster CUDA MLA.md
…
234 - Faster MLA on CUDA.md
…
235 - Option to use MLA without a transposed cache.md
…
236 - Feat_lock free server.md
…
237 - Reduce size of compute buffers.md
…
238 - A better way to measure the cost of ggml_barrier.md
…
239 - SER - Smart Expert Reduction.md
…
240 - Flash MLA _CPU only_.md
…
241 - DeepSeek CUDA Flash Attention.md
…
243 - Better FlashMLA.md
…
244 - Custom quantization rules with regular expressions.md
…
246 - Faster FlashMLA prompt processing.md
…
247 - FlashMLA on CUDA.md
…
248 - Faster MoE token generation on CUDA.md
…
250 - DeepSeek imatrix stuff.md
…
251 - Try using fp32 for FlashMLA.md
…
252 - MLA-2_ Allow usage of q8_0 for KV cache on CUDA.md
…
253 - FlashMLA-2 _CPU_ faster and smaller compute buffer size.md
…
259 - Prepare wk_b tensors of DeepSeek models on the fly.md
…
260 - FlashMLA-2_ reduce compute buffer size _CUDA and CPU_.md
…
261 - Compile time option to use bf16 for quants without MMQ kernels.md
…
262 - Fix _261.md
…
264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md
…
265 - Allow q8_0 cache on the CPU for FlashMLA-2.md
…
268 - Prevent FlashMLA-1 from running on CUDA.md
…
269 - Fix ggml_compute_forward_dup_q.md
…
270 - Honor mmap setting when using tensor overrides.md
…
272 - Convert models to row-interleaved quants using the quantize tool.md
…
273 - FlashMLA-3_ the best of both worlds _CPU only_.md
…
274 - Specify tensor name regex for tensors to be repacked.md
…
275 - Fix bug_ missing parentheses in logical expression.md
…
276 - Add Gemma3 support _text only_.md
…
277 - Attempt to improve FlashMLA on the CPU.md
…
278 - Test transparent huge pages on Linux.md
…
279 - Fighting with cmake.md
…
280 - Native build ooption for CUDA when GGML_NATIVE is set.md
…
282 - Improve DeepSeek batched processing speed.md
…
283 - CUDA_ better MoE implementation.md
…
284 - llama-bench_ enable having different number of threads for tg and pp.md
…
287 - Is this better for DeepSeek-R1_.md
…
289 - Update sweep bench _depracating .jsonl support_.md
…
290 - mmap backed KV cache.md
…
291 - Disable Zen4 optimizations for Q8_0_Q8_0_R8.md
…
292 - Use bf16 instead of fp16 block scales for q8_1.md
…
294 - Make sure tensor row size is multiple of block size also when quantizin.md
…
295 - Quantization improvements.md
…
298 - Update gguf-py constants.md
…
299 - Additional guards for interleaved quants.md
…
301 - Fix _300.md
…
302 - Quantization improvements _2_.md
…
303 - Fix ARM_NEON build failure due to q8_2.md
…
307 - Metal_ much faster MoE prompt processing.md
…
309 - Fix GCC compilation errors on ARM.md
…
310 - Metal_ FA and FlashMLA.md
…
311 - Add -flax-vector-conversions for GCC on ARM.md
…
312 - Improved IQ2_XS quantization.md
…
313 - We need to synchronize before using device to host async memcpy.md
…
315 - Try not repacking q8_0 for FA computations.md
…
317 - Add copyright notices.md
…
318 - Use links for ggml_llama.cpp authors.md
…
320 - Guard against attempts to use MLA for non-MLA models.md
…
321 - LlaMA-4 support _text only_.md
…
324 - Correct L4 rms_norm.md
…
325 - Fix KLD precision.md
…
326 - WIP Compute per layer LIM Scores during imatrix.md
…
327 - Improved IQ1_M quantization.md
…
328 - imatrix_ collect layer influence statistics.md
…
329 - Add ability to hide imatrix details in llama-quantize.md
…
330 - Allow q8_0 KV cache for head size 256.md
…
331 - Better gemm_gemv on AVX2 fr q4_0_r8.md
…
332 - Better TG performance for GQA models _CPU_.md
…
333 - Support GLM-4-0414 models based on piDack_s mainline PR.md
…
336 - Fix termux_android build.md
…
337 - Add support for bitnet2b_2501 model.md
…
338 - BitNet adjustments.md
…
341 - Add support for Cohere2.md
…
342 - Fix LLaMA-4 attention.md
…
343 - cuda_ use switch in constexpr funcs.md
…
344 - Add GLM-4-0414 Model Support.md
…
346 - Fix FA on ARM CPUs.md
…
347 - Add ability to manually set arch flags.md
…
348 - Fix q4_1 and q5_1 on Arm.md
…
349 - Fix division by zero bug.md
…
351 - CPU FA improvements.md
…
352 - Update README.md.md
…
355 - Apply Qwen3 PR from llama.cpp.md
…
356 - Add missing enum values for qwen3 and qwen3moe.md
…
360 - Fix IQK_FA_ALL_QUANTS on AVX2.md
…
364 - Fix FA bug on AVX2.md
…
366 - Add support for new Bitnet model architecture name.md
…
368 - Trying to fix iq1_s_r4_iq1_m_r4 quantization failure.md
…
369 - cmake_ force MSVC compiler charset to utf-8.md
…
370 - CUDA_ faster FA TG for GQA models.md
…
371 - Another attempt to fix _367.md
…
374 - CUDA_ MMQ for IQ4_KS.md
…
375 - Add batch warmup to sweep-bench.md
…
377 - Support for Llama-3-Nemotron models.md
…
382 - Fix DeepSeek FA.md
…
386 - FlashMLA-3 for DeepSeek models on CUDA.md
…
390 - Fix build for Xeon Gold 6226R.md
…
391 - Fix DeepSeek q8_0 cache.md
…
392 - fix some MSVC build problem..md
…
394 - Handle incompatible DeepSeek GGUFs.md
…
400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache.md
…
402 - Fix missing rope_freqs with convert_hf_to_gguf.md
…
404 - TG improvements for MoE models.md
…
405 - GPU offload policy.md
…
406 - Fix race in the CUDA DeepSeek FA kernel.md
…
408 - Faster DeepSeek FA on CUDA.md
…
409 - Enable faster prompt processing with mainline llama.cpp GGUFs.md
…
410 - Better CPU FA performance for DeepSeek-Lite.md
…
411 - Fix imatrix calculation for MLA models.md
…
413 - Fix new CUDA FA on Touring.md
…
415 - Fix SER _CPU_.md
…
416 - Fix SER _CUDA_.md
…
417 - CUDA_ quantized GEMM for for IQ4_K_ IQ5_K_ IQ6_K.md
…
418 - CUDA_ quantized GEMM for for IQ2_KS_ IQ2_K_ IQ3_K.md
…
421 - Fix standard attention on the CPU.md
…
422 - Adding IQ5_KS - 5.25 bpw quants.md
…
424 - Adding forgotten template instance for iq5_ks.md
…
426 - IQ5_KS_R4_ row-interleaved IQ5_KS.md
…
427 - Fix AVX2 implementation of IQ4_K_ IQ4_KS_ IQ5_K_ IQ6_K.md
…
428 - Zen4_ Faster PP for IQ2_KS_ IQ4_KS_ IQ5_KS.md
…
429 - Option to enable or disable the CPU FA kernels.md
…
430 - Disable multi-add for now.md
…
431 - Forgotten MMQ ref and typo.md
…
435 - Refactor iqk_mul_mat.cpp.md
…
438 - Another attempt to fix the illegal memory access bug.md
…
439 - Bug fixes from mainline.md
…
441 - Trellis quants with CPU inference.md
…
442 - CUDA call tracer.md
…
443 - Streamline a bit the quant strategies.md
…
444 - gguf-split _ update.md
…
445 - Fix typo in non-AVX2 code branch.md
…
446 - Fix bug in MMVQ kernel.md
…
448 - Fix MSVC compilation.md
…
449 - Legacy quants conversion schemes in convert_hf_to_gguf.py.md
…
453 - Faster IQ3_KT and IQ4_KT.md
…
454 - Add support for FP8 GGUF creation and re-quantization _WIP_.md
…
457 - Remove GGML_IQK_MUL_MAT option.md
…
458 - Add missing gguf-py constants.md
…
460 - aarch64 kernels for KT quants.md
…
461 - CUDA implementation for IQ2_K_R4_ IQ3_K_R4_ IQ4_K_R4_ IQ5_K_R4.md
…
462 - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4.md
…
465 - Set cache_prompt default to true.md
…
468 - Minor _2_ iq2_ks TG performance improvement on CUDA.md
…
469 - Replace MLA-specific KV cache with the standard KV cache.md
…
470 - Send _DONE_ for OAI compatibility.md
…
471 - NEON implementation for trellis quants.md
…
473 - Replace MLA-specific KV cache with the standard KV cache V2.md
…
475 - Metal implementatio for the trellis quants..md
…
478 - forgotten refs and typo.md
…
480 - Rpc improvement.md
…
481 - Webui improvement.md
…
482 - Trellis quants_ faster CPU prompt processing.md
…
483 - convert_hf_to_gguf.py _ conversion from hf weights to Q6_0.md
…
484 - BF16 Trellis implementation.md
…
486 - Adding the XTC sampler.md
…
487 - Make sure MMVQ is supported before using it.md
…
488 - Faster CPU prompt processing for Trellis quants and MoE models.md
…
489 - Adding top-n-sigma sampler.md
…
492 - CUDA implementation for IQ1_S_R4.md
…
493 - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4.md
…
494 - IQ1_M_R4 CUDA implementation.md
…
495 - Check if ffn_up and ffn_gate are of the same type before using fmoe.md
…
496 - Quick hack_ add the MLA flag to llama_hparams.md
…
497 - Make prompt cache saving and restoring MLA aware.md
…
501 - Fix _499.md
…
502 - Add an endpoint that lists all the saved prompt caches to server.md
…
504 - Add DRY and fix the server to use other new samplers..md
…
505 - New IQ4_KT trellis implementation.md
…
506 - Fix non rpc build error.md
…
508 - Fix Compile error _C2668_.md
…
509 - Docs update.md
…
510 - Update News section of readme.md
…
511 - New IQ2_KT.md
…
512 - Add top n sigma sampler in webui and other webui fix.md
…
513 - add dry sampler.md
…
515 - IQ2_XXS_ much faster CPU prompt processing.md
…
516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 _AVX2_.md
…
517 - IQ1_S_ much faster CPU prompt processing.md
…
518 - IQ3_S_ much faster CPU prompt processing.md
…
520 - Better strategy for GPU offload.md
…
524 - Perhaps a slightly better GEMV version for IQ2_XXS_ IQ3_XXS_ IQ3_S.md
…
525 - Faster CPU prompt processing for Q4_K and Q5_K.md
…
528 - Fix bug introduced in _524_525.md
…
529 - New IQ2_KT_ IQ3_KT and IQ4_KT_ V2.md
…
531 - Much faster CPU prompt processing _part 1_.md
…
533 - Much faster CPU prompt processing _part 2_.md
…
534 - Much faster CPU prompt processing _part 3_.md
…
535 - Minor readme update.md
…
536 - Fix KT Neon _ ARM typo.md
…
537 - Update CMakeLists.txt to fix NDEBUG handling.md
…
540 - Fix missed block_q8_x2 bf16 -_ i16 change.md
…
541 - Perhaps slightly faster trellis quants.md
…
542 - Fix NEON build.md
…
544 - New integer trellis on ARM_NEON.md
…
546 - Faster ARM_NEON GEMM implementation for legacy quants.md
…
547 - build_ add script to simplify build_test workflow for Android.md
…
549 - Much faster prompt processing for IQK quants _ARM_NEON_.md
…
550 - Much faster prompt processing for I-quants _ARM_NEON_.md
…
552 - Much faster prompt processing for k-quants _ARM_NEON_.md
…
553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md
…
554 - Update README.md to add quickstart section.md
…
555 - Add Falcon-Edge support.md
…
557 - CUDA_ MMQ for iqX_r4 quants.md
…
558 - Add mikupad to ik_llama as an alternative WebUI.md
…
559 - Use cuBLAS for large batches and quants with block size 16.md
…
560 - Remove what appears to be unnecessary asserts in ggml_cuda_cpy.md
…
563 - Merge vulkan code from mainline up to commit of 6_28_2025.md
…
565 - add hunyuan moe support for 561.md
…
566 - Adding IQ3_KS quants.md
…
567 - Minor CUDA PP speed improvement.md
…
569 - Conditionally disable fused ops when building with Vulkan enabled.md
…
570 - Remove duplicate_misplaced cmake find_package for Vulkan.md
…
571 - Fix CMakeLists.md
…
573 - Support for dots.llm1 models.md
…
574 - Change KQ mask padding to 64.md
…
577 - Vulkan_ fused rms norm.md
…
578 - Do not crash when there is no DRY sampler.md
…
579 - Fix debug build failure with RPC off.md
…
580 - Vulkan_ add GGML_OP_FUSED_MUL_UNARY.md
…
581 - Vulkan_ Disable multi-add for now.md
…
582 - Vulkan_ adding GGML_OP_MULTI_ADD implementation.md
…
583 - Adding forgotten file.md
…
584 - Vulkan_ flash attention for DeepSeek models.md
…
585 - Special handling of Seed Coder FIM tokens.md
…
587 - Fix crash when there is no DRY sampler.md
…
588 - Fix server crash when there is no DRY sampler.md
…
589 - CUDA_ small PP performance improvement for MoE models.md
…
592 - Another minor readme update.md
…
593 - Faster prompt processing for IQ2_KS_ IQ2_K_ IQ2_K_R4.md
…
595 - CUDA_ Faster prompt processing for several quantization types.md
…
598 - Vulkan_ iquants and flash attention split_k_reduce improvement.md
…
602 - Adding IQ2_KL.md
…
603 - Check if MMQ should be used before using it.md
…
604 - Fix attn_v conditionality when quantizing..md
…
606 - Add iq3_ks to constants.py.md
…
607 - vulkan_ support softmax_FA batch and broadcast.md
…
608 - Vulkan_ a fresh start.md
…
609 - Added kimi-k2 support _ported from llama.cpp_.md
…
610 - q8_k_r8_ experimental AVX512 version.md
…
611 - Bump GGML_MAX_CONTEXTS to allow loading more shards.md
…
612 - kimi-k2 convert script and chat template.md
…
616 - Adding IQ1_KT - 1.75 bpw SOTA quants.md
…
617 - Fixup kimi-k2 convert indentation.md
…
618 - Webui_ New Features for Conversations_ Settings_ and Chat Messages.md
…
620 - Bump Windows max open files from 512 to 2048.md
…
622 - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt.md
…
624 - Quantization tweaks.md
…
628 - _Draft_ Function calling support for Kimi-K2.md
…
630 - GEMM for IQ1_M.md
…