* Fix changes meaning warnings
* A couple of more warnings and formatting
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mainline compatible FA command line option
* Graph reuse: add command line argument to turn it on
* WIP
* This seems to work
* This is perhaps cleaner
* Change the command line option to -gr
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix q5_0_r4
The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.
* Fix q6_0_r4
Same fix as q5_0_r4
* Fix q4_0_r8
* Fix q5_0_r4 and q6_0_r4 also on Zen4
* Fix q4_0_r8 also on Zen4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use new-new-mma also for MLA=3, and use mask bounds
This gives us ~25% better PP at 32k tokens compared to main
* This seems better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1
Combined effect: about +2% in TG performance with full GPU offload
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add command line argument for draft model
* Remove second context of draft model
* Format print
* print usage if parsing -draft fails
---------
Co-authored-by: firecoperana <firecoperana>
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.
Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions
Build requirements for ARM:
cmake .. -DGGML_CUDA=ON \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"
Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory
Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
* server: fix crash when prompt has image and is too long
* server: fix CORS
* server: fix empty result for embedding
* change error message to truncate prompt
* server: fix slot id for save and load state
* bug fix
* server: update slot similarity to handle mtmd
* server: quick hack to calculate number of token processed with image
* server: fix out of range error when detokenizing prompt under verbose
* Add back Access-Control-Allow-Origin
* Server: Add prompt tokens in embedding results
---------
Co-authored-by: firecoperana <firecoperana>
* Use mmq_id in mul_mat_id
* Better
* Also use it in the fused up+gate op
* Better -no-fmoe TG on CUDA
Still much slower than -fmoe, but abot 20-25% faster than what
we had before.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bug fixes for completions and prompt caching in server
* Fix compiler warning about redefinition
---------
Co-authored-by: firecoperana <firecoperana>
* Merge Q and K into a single tensor
* Make V mul mat follow QK mul mat
so they can be fused, which gives a slightly bbetter TG performance.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: add support for vision model
webui: add support for vision model
* server : remove hack for extra parallel slot#10187
* llama : fix KV shift for qwen2vl #13870
* add no-context-shift parameter
---------
Co-authored-by: firecoperana <firecoperana>