ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-20 14:39:45 +00:00

Author	SHA1	Message	Date
Nexes the Elder	094f76ee86	Cleaner log for adjusted splits (#1494 ) * sweep-bench: add more skipped patterns to --minilog * cleaner log for adjusted splits * Add totalization for adjusted splits * Clean up semicolons * Addition for totalizer ^^ * Change accordingly to review * Forgotten leftover removed * 'total' instead of 'totalized'	2026-03-24 07:49:40 +01:00
Nexes the Elder	6c665f38fd	sweep-bench: add -minilog argument to reduce verbose logging (#1468 ) Purpose: Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages. Changes: - Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp) Blacklisted patterns (hidden): - Per-layer device assignments ('Setting default device in layer') - KV metadata dump header and entries - Tensor type counts - Model validation messages - EOG/special token cache info - Metadata printout (llm_load_print_meta, print_info) - Layer sizes table - Tensor loading info (llm_load_tensors) - Separator lines - Most common cases of incomplete/continuation lines are also hidden All other log output is shown, including: - GPU VRAM info - Split/buffer distribution per device - Graph split estimates - Final benchmark table and timings	2026-03-20 09:40:56 +01:00
Nexes the Elder	61fad8b094	Print timings in sweep-bench (#1454 )	2026-03-18 06:57:00 +01:00
firecoperana	1cb7e1bf39	spec : add self speculative decoding, ngram and refactor (#1261 ) * spec : add self speculative decoding and ngram-mod and refactor common : use common_ prefix for common library function llama : use LLAMA_TOKEN_NULL spec : add self speculative decoding (no draft model required) + refactor spec : add ngram-mod spec : various improvements ton ngram-map + docs spec : fix the check-rate logic of ngram-simple common : add common_speculative_is_compat() spec : simplify time measurement using common_time_meas refactor common_sampler_init refactor common_token_to_piece refactor and fix cur_p bug clean up * spec : remove check rate * spec: show warnings instead of abort --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>	2026-02-13 19:04:55 +01:00
Kawrakow	573e23679d	sweep_bench: set number of repetions (#1176 )	2026-01-22 12:28:30 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
Kawrakow	efcb5f9d9e	sweep-bench: be able to set TG tokens via -n (#897 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 14:39:30 +02:00
Kawrakow	633e0617b0	Enable CUDA graphs for MoE models + GPT-OSS support (#689 ) * gmp-oss: common * gpt-oss: attnetion sinks, swiglu_oai * gpt-oss: WIP llama Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere? * gpt-oss: CPU seems to be working It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added. * CUDA: ADD_ID Just a copy from mainline * gpt-oss: Seems to be working on CUDA * gpt-oss: add sinks to the attn-vec kernels * CUDA: add head size of 64 to new mma Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that. * gpt-oss: add ability to use -fmoe (only CUDA for now) * Move row sums to the write place * Add sinks to iqk flash attention * gpt_oss: Implement -fmoe on the CPU * Simdify swiglu_oai Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard. * llama: factor out model loader * Builds successfully * It runs, but mmap does not work * Fix llama_mmap so mmap works * Minor * Fix CUDA after latest changes * Attempt to use CUDA graphs with MoE models - not working * CUDA graphs WIP - still not working * CUDA graphs - seems to be working Likely not all MLA variants are working. I no longer remember why I added the q8_0 cpy that transposes the tensor, but if really needed, this is now missing. Also missing is q6_0. * Make q8_0 cache work for DeepSeek models with CUDA graphs * cuda: cpy for q6_0 * Fix llama_mmap on non-Linux platforms * Adding forgotten file * Iterating on Windows build failures * cuda: re-add q8_0 -> q8_0 transpose so mla = 2 can be used with CUDA graphs and q8_0 cache. * Disable graphs without -fmoe * Minor * Turn graphs on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 09:18:07 +03:00
Kawrakow	ceb8f513e4	Add batch warmup to sweep-bench (#375 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:50:26 +03:00
saood06	c12a6f8558	Update sweep bench (depracating .jsonl support) (#289 ) * Update sweep bench (depracating .jsonl support) * Fix README.md	2025-03-25 10:14:44 -05:00
saood06	ce1b59f08c	Add new sweep-bench benchmark (#225 ) * examples : add new sweep-bench benchmark * Change documentation to reference ik_llama.cpp * Made it compile with ik_llama * Fix JSONL output --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-02-23 00:16:27 -06:00

11 Commits