ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-07 04:20:03 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	ca5cff8677	merge_qkv: glm4.5moe	2025-10-29 13:57:55 +02:00
Iwan Kawrakow	18765a4907	merge_qkv: bias can be required, optional, or mandatory	2025-10-29 13:57:55 +02:00
Iwan Kawrakow	2b3af4addc	WIP	2025-10-29 13:57:55 +02:00
Iwan Kawrakow	c699846aa6	merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%)	2025-10-29 13:57:55 +02:00
Iwan Kawrakow	446b4a4da3	WIP	2025-10-29 13:57:55 +02:00
Iwan Kawrakow	d73914c70b	POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain.	2025-10-29 13:57:55 +02:00
Kawrakow	9d364b88ba	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	8d0d01a593	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	0030bc89c9	Fix performance regression introduced in #823 (#826 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 08:09:55 +03:00
Kawrakow	0ad1d34090	Enable and clean up compiler warnings in src (#824 ) * WIP: enable and clean up warnings in src * All warnings handled --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 16:01:13 +03:00
Kawrakow	335a1f9b71	Refactor file llama.cpp (#823 ) * llama_model and llama_hparams * llama_build_context Surprisingly small reduction in llama.cpp compile time given the reduction in LOCs (22k -> 14k) * LLM_TN llama.cpp compilation: 50 s -> 33 s * llama_quantize * arch names * All graph building is now in llm-build-context.cpp * hparams loading llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile. * We are now at 6 seconds to build the src folder * load -> create We are not actually loading the tensors, but just creating them. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 11:35:20 +03:00

11 Commits