11 Commits

Author SHA1 Message Date
Nexes the Elder
094f76ee86 Cleaner log for adjusted splits (#1494)
* sweep-bench: add more skipped patterns to --minilog

* cleaner log for adjusted splits

* Add totalization for adjusted splits

* Clean up semicolons

* Addition for totalizer ^^

* Change accordingly to review

* Forgotten leftover removed

* 'total' instead of 'totalized'
2026-03-24 07:49:40 +01:00
Nexes the Elder
6c665f38fd sweep-bench: add -minilog argument to reduce verbose logging (#1468)
Purpose:
Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages.

Changes:
- Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp)

Blacklisted patterns (hidden):
- Per-layer device assignments ('Setting default device in layer')
- KV metadata dump header and entries
- Tensor type counts
- Model validation messages
- EOG/special token cache info
- Metadata printout (llm_load_print_meta, print_info)
- Layer sizes table
- Tensor loading info (llm_load_tensors)
- Separator lines
- Most common cases of incomplete/continuation lines are also hidden

All other log output is shown, including:
- GPU VRAM info
- Split/buffer distribution per device
- Graph split estimates
- Final benchmark table and timings
2026-03-20 09:40:56 +01:00
Nexes the Elder
61fad8b094 Print timings in sweep-bench (#1454) 2026-03-18 06:57:00 +01:00
firecoperana
1cb7e1bf39 spec : add self speculative decoding, ngram and refactor (#1261)
* spec : add self speculative decoding and ngram-mod and refactor

common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up

* spec : remove check rate

* spec: show warnings instead of abort

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
2026-02-13 19:04:55 +01:00
Kawrakow
573e23679d sweep_bench: set number of repetions (#1176) 2026-01-22 12:28:30 +02:00
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
efcb5f9d9e sweep-bench: be able to set TG tokens via -n (#897)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 14:39:30 +02:00
Kawrakow
633e0617b0 Enable CUDA graphs for MoE models + GPT-OSS support (#689)
* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 09:18:07 +03:00
Kawrakow
ceb8f513e4 Add batch warmup to sweep-bench (#375)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:50:26 +03:00
saood06
c12a6f8558 Update sweep bench (depracating .jsonl support) (#289)
* Update sweep bench (depracating .jsonl support)

* Fix README.md
2025-03-25 10:14:44 -05:00
saood06
ce1b59f08c Add new sweep-bench benchmark (#225)
* examples : add new sweep-bench benchmark

* Change documentation to reference ik_llama.cpp

* Made it compile with ik_llama

* Fix JSONL output

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-23 00:16:27 -06:00