* POC: merge Q, K, V into a single, contiguous tensor
Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.
* WIP
* merge_qkv: it works for gpt-oss
...but we see a smaller TG gain (~1.5%)
* WIP
* Don't ignore the return value of create_tensors()
else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.
* merge_qkv: bias can be required, optional, or mandatory
* merge_qkv: glm4.5moe
* merge_qkv: add command loine argument to enable
* merge_qkv: fix tensor dimensions
* merge_qkv: llama-4
* merge_qkv: qwen3 (dense)
* merge_qkv: simplify build_qwen3moe
* cohere2 - simplify graph building
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>