49 KiB
🐛 #400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-05-09 |
| Updated | 2025-05-09 |
Description
The implementation was assuming that the K and V cache are contiguous, and was using this assumption to dequantize to fp16. This is certainly wrong for the V cache, which is just a view of the K cache with rows of 512 instead of 576 elements.
@JohannesGaessler You may want to take a look at this PR. I don't think your PR in mainline llama.cpp can work for DeepSeek models with quantized KV cache.
A test session with this model:
./bin/llama-cli -m ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf -t 32 -ngl 100 -mla 3 -fa -c 32768 -s 1234 -ot exps=CPU -cnv -ctk q8_0 -ctv q8_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
Log start
main: build = 3673 (4084ca73)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1234
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V2.5 1210
llama_model_loader: - kv 3: general.version str = V2.5-1210
llama_model_loader: - kv 4: general.basename str = DeepSeek
llama_model_loader: - kv 5: general.size_label str = 160x14B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = deepseek
llama_model_loader: - kv 8: general.license.link str = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv 9: deepseek2.block_count u32 = 60
llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 5120
llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 12288
llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000,000000
llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 6
llama_model_loader: - kv 18: general.file_type u32 = 23
llama_model_loader: - kv 19: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 20: deepseek2.vocab_size u32 = 102400
llama_model_loader: - kv 21: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 22: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 23: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 24: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 25: deepseek2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 26: deepseek2.expert_count u32 = 160
llama_model_loader: - kv 27: deepseek2.expert_shared_count u32 = 2
llama_model_loader: - kv 28: deepseek2.expert_weights_scale f32 = 16,000000
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40,000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0,100000
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-llm
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 100000
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 100001
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 100001
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - kv 46: quantize.imatrix.file str = /models_out/DeepSeek-V2.5-1210-GGUF/D...
llama_model_loader: - kv 47: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 716
llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 139
llama_model_loader: - kv 50: split.no u16 = 0
llama_model_loader: - kv 51: split.count u16 = 3
llama_model_loader: - kv 52: split.tensors.count i32 = 959
llama_model_loader: - type f32: 300 tensors
llama_model_loader: - type q5_K: 1 tensors
llama_model_loader: - type iq3_xxs: 597 tensors
llama_model_loader: - type iq3_s: 61 tensors
llm_load_vocab: special tokens cache size = 18
llm_load_vocab: token to piece cache size = 0,6411 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 102400
llm_load_print_meta: n_merges = 99757
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0,0e+00
llm_load_print_meta: f_norm_rms_eps = 1,0e-06
llm_load_print_meta: f_clamp_kqv = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale = 0,0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 160
llm_load_print_meta: n_expert_used = 6
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000,0
llm_load_print_meta: freq_scale_train = 0,025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 236B
llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params = 235,741 B
llm_load_print_meta: model size = 84,604 GiB (3,083 BPW)
llm_load_print_meta: repeating layers = 84,058 GiB (3,077 BPW, 234,693 B parameters)
llm_load_print_meta: general.name = DeepSeek V2.5 1210
llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 126 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 1
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 1536
llm_load_print_meta: n_expert_shared = 2
llm_load_print_meta: expert_weights_scale = 16,0
llm_load_print_meta: expert_weights_norm = 0
llm_load_print_meta: expert_gating_func = softmax
llm_load_print_meta: rope_yarn_log_mul = 0,1000
llm_load_tensors: ggml ctx size = 0,80 MiB
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 37343,30 MiB
llm_load_tensors: CPU buffer size = 37866,68 MiB
llm_load_tensors: CPU buffer size = 10656,64 MiB
llm_load_tensors: CPU buffer size = 214,84 MiB
llm_load_tensors: CUDA0 buffer size = 5109,97 MiB
....................................................................................................
============ llm_load_tensors: need to compute 60 wk_b tensors
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 0,025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: CUDA0 KV buffer size = 1147,53 MiB
llama_new_context_with_model: KV self size = 1147,50 MiB, c^KV (q8_0): 1147,50 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0,39 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 6346,00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 74,01 MiB
llama_new_context_with_model: graph nodes = 3350
llama_new_context_with_model: graph splits = 179
main: chat template example: You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with ''.
> Count from 1 to 10 in French
Sure! Here's how you count from 1 to 10 in French:
- Un
- Deux
- Trois
- Quatre
- Cinq
- Six
- Sept
- Huit
- Neuf
- Dix
Tell me everything you know about radiation therapy of cancer
Radiation therapy, also known as radiotherapy, is a medical treatment that uses high doses of radiation to kill cancer cells and shrink tumors. Here’s an overview of everything you need to know about radiation therapy for cancer:
Types of Radiation Therapy
-
External Beam Radiation Therapy (EBRT):
- Description: Radiation is delivered from a machine outside the body, targeting the cancerous tumor.
- Common Techniques:
- 3D Conformal Radiation Therapy: Uses multiple radiation beams to shape the treatment to the tumor’s 3D shape.
- Intensity-Modulated Radiation Therapy (IMRT): Delivers varying doses of radiation to different parts of the tumor, reducing damage to nearby healthy tissues.
- Proton Therapy: Uses protons instead of X-rays, allowing for precise targeting of the tumor with minimal radiation to surrounding tissues.
- Stereotactic Radiosurgery (SRS) and Stereotactic Body Radiation Therapy (SBRT): High-precision techniques used for small tumors or lesions, often in the brain or lung.
-
Internal Radiation Therapy (Brachytherapy):
- Description: Radioactive sources are placed inside the body, either temporarily or permanently, directly into or near the tumor.
- Types:
- High Dose Rate (HDR) Brachytherapy: Temporary placement of radioactive material for a short period.
- Low Dose Rate (LDR) Brachytherapy: Permanent placement of radioactive seeds, commonly used for prostate cancer.
-
Systemic Radiation Therapy:
- Description: Radioactive substances are administered through the bloodstream, targeting cancer cells throughout the body.
- Examples:
- Radioactive iodine (I-131) for thyroid cancer.
- Lutetium-177 (Lu-177) or Yttrium-90 (Y-90) for neuroendocrine tumors.
Purpose of Radiation Therapy
-
Cancer Treatment:
- Curative Intent: To eliminate the cancer completely, often used in early-stage cancers.
- Palliative Treatment: To relieve symptoms and improve quality of life for advanced-stage cancers.
- Adjuvant Therapy: Used after surgery to eliminate any remaining cancer cells.
- Neoadjuvant Therapy: Used before surgery to shrink the tumor, making surgery easier and potentially reducing the extent of surgery needed.
-
Prevention of Recurrence:
- Radiation may be used to reduce the risk of cancer recurrence in high-risk patients.
Side Effects of Radiation Therapy
-
Acute Side Effects:
- Skin Reactions: Redness, irritation, and peeling.
- Fatigue: Common and often temporary.
- Swelling or Edema: Due to fluid accumulation in treated areas.
- Mucositis: Inflammation of the mucous membranes, common in head and neck cancers.
-
Late Side Effects:
- Tissue Fibrosis: Scarring of the irradiated tissues.
- Lymphedema: Swelling due to damage to the lymphatic system.
- Cognitive Changes: Rare, but possible in brain radiation.
- Secondary Cancers: Extremely rare but possible years after treatment.
Factors Influencing Radiation Therapy
-
Tumor Characteristics:
- Type, size, and location of the tumor.
- Whether the tumor is localized or has spread (metastatic).
-
Patient Characteristics:
- Age and overall health.
- Previous cancer treatments and their outcomes.
- Genetic factors and cancer predisposition.
-
Radiation Techniques and Doses:
- The type of radiation used and the dose administered.
- The fractionation schedule (how the dose is divided over time).
Advances in Radiation Therapy
-
Imaging and Targeting:
- Advanced imaging techniques like PET scans help in precisely targeting the tumor.
- Image-guided radiation therapy (IGRT) uses real-time imaging to adjust the radiation beam during treatment.
-
Personalized Medicine:
- Customized treatment plans based on the individual patient’s tumor characteristics and genetics.
-
Integration with Other Treatments:
- Combination with chemotherapy, immunotherapy, or surgery for better outcomes.
Conclusion
Radiation therapy is a powerful tool in the fight against cancer, offering curative potential, symptom relief, and preventive measures. Advances in technology continue to improve its precision and reduce side effects, making it an essential component of modern cancer care.
How many r's are there in strawberry
The word "strawberry" contains 3 r's.
Can you solve the following system of equations for me? x + y = 2 x - y = 3
Certainly! Let's solve the system of equations step by step:
We are given two equations:
- ( x + y = 2 )
- ( x - y = 3 )
Step 1: Add the two equations
Adding the left-hand sides: [ (x + y) + (x - y) = 2 + 3 ] [ 2x = 5 ]
Step 2: Solve for ( x )
[ 2x = 5 ] [ x = \frac{5}{2} ] [ x = 2.5 ]
Step 3: Substitute ( x = 2.5 ) into one of the original equations to find ( y )
Let's use the first equation: [ x + y = 2 ] [ 2.5 + y = 2 ] [ y = 2 - 2.5 ] [ y = -0.5 ]
Final Solution
The solution to the system of equations is: [ x = 2.5 ] [ y = -0.5 ]
Can you explain the German electoral system?
Certainly! The German electoral system is characterized by a mixed-member proportional representation (MMP) system, which combines elements of both constituency-based voting and proportional representation. Here’s an overview of how it works:
Electoral System Overview
-
Electoral Districts:
- Germany is divided into 299 single-member constituencies (Wahlkreise), each represented by one directly elected Member of Parliament (MP).
- Voters cast two votes: one for a candidate in their local constituency (Erststimme or first vote) and one for a political party (Zweitstimme or second vote).
-
Directly Elected MPs (Erststimme):
- The candidate with the most votes in each constituency wins a seat in the Bundestag (Federal Parliament).
-
Party List MPs (Zweitstimme):
- The second vote determines the proportional representation of parties in the Bundestag.
- Each party’s share of the total second votes determines how many additional seats they receive beyond the directly elected MPs.
Seat Allocation Process
-
Initial Allocation:
- Directly elected MPs from the constituencies are first seated.
- The remaining seats are filled based on the proportion of second votes each party receives.
-
Overhang Seats:
- If a party wins significantly more constituency seats than its share of the second votes would allow, additional "overhang" seats are added to maintain proportionality.
-
Compensatory Seats:
- To ensure overall proportionality, compensatory seats are allocated to parties that did not win enough constituency seats to match their share of the second votes.
Formation of the Bundestag
-
Minimum Threshold:
- Parties must receive at least 5% of the second votes nationwide or win at least three constituency seats to enter the Bundestag.
-
Government Formation:
- The party or coalition with the majority of seats forms the government.
- If no single party has a majority, a coalition government is formed between two or more parties.
Key Features
-
Stability:
- The system promotes stable government formation by allowing for coalition-building among parties with similar political platforms.
-
Representation:
- Proportional representation ensures that smaller parties have a voice in the Bundestag, reflecting the diversity of political views in the electorate.
-
Direct Representation:
- Voters have a personal connection to their directly elected MP, enhancing local representation.
Conclusion
The German electoral system is designed to balance local representation with proportional representation of political parties, ensuring a stable and representative government. This hybrid system has been successful in maintaining political stability while reflecting the diverse political landscape of Germany.
Can you find the encoding rule for the following exmple: Encoded text: oyfjdnisdr rtqwainr acxz mynzbhhx Decoded text: Think step by step
To decode the encoded text "oyfjdnisdr rtqwainr acxz mynzbhhx" to "Think step by step," we need to find the encoding rule. Given the nature of the encoded text and the decoded text, it seems likely that the encoding involves a simple substitution cipher, where each letter in the plaintext is replaced by a different letter.
Let's break down the encoded text and compare it with the decoded text:
Encoded Text:
oyfjdnisdr rtqwainr acxz mynzbhhx
Decoded Text:
Think step by step
Step 1: Analyze the Lengths
- "Think step by step" has 16 characters.
- "oyfjdnisdr rtqwainr acxz mynzbhhx" has 32 characters.
This suggests that each letter in the decoded text corresponds to two letters in the encoded text.
Step 2: Identify Letter Pairs
Let's align the letters:
Decoded Text:
T h i n k s t e p b y s t e p
Encoded Text:
o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x
Step 3: Match Letters
Now, let's match each pair of letters from the encoded text to a single letter in the decoded text:
T→o yh→f ji→d nn→i sk→d rs→r tt→q we→a ip→n rb→a cy→x zs→m yt→n ze→b hp→h x
Step 4: Formulate the Encoding Rule
From the above matching, we can infer that the encoding rule is:
- Each letter in the decoded text corresponds to two letters in the encoded text.
- The mapping is not a simple substitution but rather a pairing of letters.
Conclusion
The encoding rule for this example is that each letter in the decoded text is represented by a pair of letters in the encoded text. The specific mapping can be derived from the given example:
T→o yh→f ji→d nn→i sk→d rs→r tt→q we→a ip→n rb→a cy→x zs→m yt→n ze→b hp→h x
This rule can be used to decode other similarly encoded texts.
llama_print_timings: load time = 11641,32 ms llama_print_timings: sample time = 135,78 ms / 2822 runs ( 0,05 ms per token, 20782,86 tokens per second) llama_print_timings: prompt eval time = 24467,67 ms / 126 tokens ( 194,19 ms per token, 5,15 tokens per second) llama_print_timings: eval time = 224627,46 ms / 2816 runs ( 79,77 ms per token, 12,54 tokens per second) llama_print_timings: total time = 517022,50 ms / 2942 tokens
Here a quick sweep-bench performance test
fp16 KV cache
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 14.243 | 143.79 | 39.607 | 12.93 |
| 2048 | 512 | 2048 | 14.741 | 138.93 | 40.155 | 12.75 |
| 2048 | 512 | 4096 | 15.250 | 134.29 | 40.546 | 12.63 |
| 2048 | 512 | 6144 | 15.778 | 129.80 | 41.711 | 12.27 |
| 2048 | 512 | 8192 | 16.303 | 125.62 | 41.891 | 12.22 |
| 2048 | 512 | 10240 | 16.847 | 121.57 | 42.925 | 11.93 |
| 2048 | 512 | 12288 | 17.497 | 117.05 | 43.123 | 11.87 |
| 2048 | 512 | 14336 | 17.874 | 114.58 | 43.521 | 11.76 |
Q8_0 KV cache
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 14.284 | 143.38 | 39.549 | 12.95 |
| 2048 | 512 | 2048 | 14.795 | 138.42 | 40.182 | 12.74 |
| 2048 | 512 | 4096 | 15.379 | 133.17 | 40.770 | 12.56 |
| 2048 | 512 | 6144 | 18.119 | 113.03 | 42.032 | 12.18 |
| 2048 | 512 | 8192 | 16.466 | 124.38 | 42.423 | 12.07 |
| 2048 | 512 | 10240 | 16.945 | 120.86 | 43.506 | 11.77 |
| 2048 | 512 | 12288 | 17.601 | 116.35 | 43.925 | 11.66 |
| 2048 | 512 | 14336 | 17.987 | 113.86 | 44.597 | 11.48 |
I.e., only very slightly slower than fp16 KV cache. The KV cache is quite small with FlashMLA-3, but if one wants to go to 160k tokens with DeepSeek-V3/R1, using Q8_0 KV cache instead of fp16 may make the difference between being able or not being able to run with a single 24 GB GPU.
💬 Conversation
👤 JohannesGaessler commented the 2025-05-09 at 07:23:38:
Thank you for notifying me. I am aware of the defect, on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory. I can't comment on the specific code in this PR since I won't look at it unless you explicitly tell me I'm allowed to do so even without the conflict between you and Georgi first being resolved. The way I would have gone about it would have been not to use the V tensor at all, to dequantize K, and to then calculate the pointer, dimension, and strides for a pseudo V tensor from the K tensor.
👤 ikawrakow commented the 2025-05-09 at 07:25:52:
Forgot to add -rtr in the above performance test. Here it is with -rtr and q8_0 KV cache
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 13.348 | 153.43 | 36.662 | 13.97 |
| 2048 | 512 | 2048 | 14.637 | 139.92 | 37.208 | 13.76 |
| 2048 | 512 | 4096 | 14.478 | 141.46 | 37.720 | 13.57 |
| 2048 | 512 | 6144 | 14.880 | 137.64 | 39.034 | 13.12 |
| 2048 | 512 | 8192 | 16.081 | 127.36 | 39.282 | 13.03 |
| 2048 | 512 | 10240 | 16.240 | 126.11 | 40.409 | 12.67 |
| 2048 | 512 | 12288 | 17.001 | 120.47 | 40.805 | 12.55 |
| 2048 | 512 | 14336 | 18.056 | 113.42 | 41.437 | 12.36 |
👤 ikawrakow commented the 2025-05-09 at 07:31:04:
on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory.
Oh, yes, I forgot about that.
In any case, the PR in ik_llama.cpp is mostly a copy of your mainline PR, so you looking at the code you wrote in my repository hopefully does not break Georgi's rules.
👤 JohannesGaessler commented the 2025-05-09 at 07:49:51:
My concern specifically is whether you would consider any of my work on mainline after looking at your code to be including a "substantial portion" of your work and could thus only be included in conjunction with the copyright notices in ik_llama.cpp. Much like you I am not a lawyer but if you tell me that you will not consider me looking at your work to be a license violation (or that in some specific case you waive the requirement of copyright notices) then there is no need for lawyers in the first place.