* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid computing FA chunks where the mask is -infinity
* Avoid computing FA chunks where the mask is -infinity also for f16/bf16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.
That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.
A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
still a TODO in mainline [1].
Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:
This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)
If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)
[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Quick attempt to fuse the Q, K, V GEMMs
Doesn't do much on the CPU
* Doesn't do much on the GPU either
* Use llm_build_mul_mat_qkv
* This is not needed
* Revert timing on committed by mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)
* config reasoning_content in webui and change default to auto
---------
Co-authored-by: firecoperana <firecoperana>
* Offload only activated experts
* This seems to do the trick for -fmoe
* Do not recalculate activated expers for fused up/gate
* Log out of bounds access details
* Add a command line argument
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bounds for flash attention
* Add n_swa to FA parameters
* Fix it
* This seems very slightly better
* Using vec kernel when we have SWA
* Need also this
* f32 vec kernel
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fused up+gate+unary for regular (not MoE) FFN - CPU
* WIP CUDA
* Seems to be working on CUDA
For a dense model we get 2-3% speedup for PP and ~0.6% for TG.
* Add command line option
This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Skip the row id computation for the ffn_down op
Sadly, almost negligible performance gain.
* Also this doesn't do much
* Also this barely moves the needle
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Check for NaNs while loading the model.
* Also tell which experts have NaNs.
* Add command line option to validate quants
* Add checks for more quantization types
* Add checks for more quantizagtion types
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This fixes confusion around Q8_0 on AVX2
* This does it for iq4_nl, including FA
* This does it for iq4_nl on Zen4, but FA does not work
* Slightly more clear
* Adding forgotten q8_0_r8 to num_rows()
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Getting this error when compiling on Mac with clang 17
Simple fix, add the string header in src/llama-impl.h
Co-authored-by: Mohan Krishnan <mohan.krishnan@grab.com>
* mikupad.html in ik_llama.cpp (functional but WIP)
* Remove hardcoded extension and add error handling to extension loading
* Update version number and add features array to version
* Make version endpoint always accessible
* Fix case with empty sql
* Add useful error message when launched without sql file
* Add sigma sampler
* Update sigma step and max based on docs
* Remove selectedSessionId and handle it with URL fragment
* Export All (code only, no UI)
* Add compression to server.cpp
* Major UI work (and also add update backend endpoints to accomadate)
* Finalize UI
* Fix visual bug
* fix merge conflict issue
* Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files
* Make compression not show in sidebar if extension is not loaded
* Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation
* Fix compile without flag on systems without it installed