* Add mainline compatible FA command line option
* Graph reuse: add command line argument to turn it on
* WIP
* This seems to work
* This is perhaps cleaner
* Change the command line option to -gr
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix q5_0_r4
The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.
* Fix q6_0_r4
Same fix as q5_0_r4
* Fix q4_0_r8
* Fix q5_0_r4 and q6_0_r4 also on Zen4
* Fix q4_0_r8 also on Zen4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use new-new-mma also for MLA=3, and use mask bounds
This gives us ~25% better PP at 32k tokens compared to main
* This seems better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1
Combined effect: about +2% in TG performance with full GPU offload
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add command line argument for draft model
* Remove second context of draft model
* Format print
* print usage if parsing -draft fails
---------
Co-authored-by: firecoperana <firecoperana>
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.
Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions
Build requirements for ARM:
cmake .. -DGGML_CUDA=ON \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"
Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory
Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
* server: fix crash when prompt has image and is too long
* server: fix CORS
* server: fix empty result for embedding
* change error message to truncate prompt
* server: fix slot id for save and load state
* bug fix
* server: update slot similarity to handle mtmd
* server: quick hack to calculate number of token processed with image
* server: fix out of range error when detokenizing prompt under verbose
* Add back Access-Control-Allow-Origin
* Server: Add prompt tokens in embedding results
---------
Co-authored-by: firecoperana <firecoperana>
* Use mmq_id in mul_mat_id
* Better
* Also use it in the fused up+gate op
* Better -no-fmoe TG on CUDA
Still much slower than -fmoe, but abot 20-25% faster than what
we had before.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bug fixes for completions and prompt caching in server
* Fix compiler warning about redefinition
---------
Co-authored-by: firecoperana <firecoperana>
* Merge Q and K into a single tensor
* Make V mul mat follow QK mul mat
so they can be fused, which gives a slightly bbetter TG performance.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: add support for vision model
webui: add support for vision model
* server : remove hack for extra parallel slot#10187
* llama : fix KV shift for qwen2vl #13870
* add no-context-shift parameter
---------
Co-authored-by: firecoperana <firecoperana>
* Introducing rope cache
When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?
This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.
* cuda: neox works
* WIP
* rope_cache: norm works
* Fused rope+rope
* Fused rope+rope (norm)
* Fused rms+rms+rope+rope (neox) - not working
* WIP
* Also qwen3
* Add command line arg to disable rope cache
* Disable RoPE cache if rope type is not neox or norm
* Add missing break after merge with main
* Fused fused_rms+fused_rms+rope+rope (with -mqkv)
* Fused fused_rms+fused_rms+rope+rope (without -mqkv)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>