mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-01 03:41:53 +00:00
840 lines
49 KiB
Markdown
840 lines
49 KiB
Markdown
### 🐛 [#400](https://github.com/ikawrakow/ik_llama.cpp/pull/400) - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache
|
||
|
||
| **Author** | `ikawrakow` |
|
||
| :--- | :--- |
|
||
| **State** | ❌ **Closed** |
|
||
| **Created** | 2025-05-09 |
|
||
| **Updated** | 2025-05-09 |
|
||
|
||
---
|
||
|
||
#### Description
|
||
|
||
The implementation was assuming that the K and V cache are contiguous, and was using this assumption to dequantize to `fp16`. This is certainly wrong for the V cache, which is just a view of the K cache with rows of 512 instead of 576 elements.
|
||
|
||
@JohannesGaessler You may want to take a look at this PR. I don't think your [PR in mainline llama.cpp](https://github.com/ggml-org/llama.cpp/pull/13306) can work for DeepSeek models with quantized KV cache.
|
||
|
||
A test session with [this model](https://huggingface.co/bartowski/DeepSeek-V2.5-1210-GGUF/tree/main/DeepSeek-V2.5-1210-IQ3_XXS):
|
||
|
||
```
|
||
./bin/llama-cli -m ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf -t 32 -ngl 100 -mla 3 -fa -c 32768 -s 1234 -ot exps=CPU -cnv -ctk q8_0 -ctv q8_0
|
||
```
|
||
<details>
|
||
<code>
|
||
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
||
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
||
ggml_cuda_init: found 1 CUDA devices:
|
||
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
|
||
Log start
|
||
main: build = 3673 (4084ca73)
|
||
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
|
||
main: seed = 1234
|
||
llama_model_loader: additional 2 GGUFs metadata loaded.
|
||
llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = deepseek2
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = DeepSeek V2.5 1210
|
||
llama_model_loader: - kv 3: general.version str = V2.5-1210
|
||
llama_model_loader: - kv 4: general.basename str = DeepSeek
|
||
llama_model_loader: - kv 5: general.size_label str = 160x14B
|
||
llama_model_loader: - kv 6: general.license str = other
|
||
llama_model_loader: - kv 7: general.license.name str = deepseek
|
||
llama_model_loader: - kv 8: general.license.link str = https://github.com/deepseek-ai/DeepSe...
|
||
llama_model_loader: - kv 9: deepseek2.block_count u32 = 60
|
||
llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840
|
||
llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 5120
|
||
llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 12288
|
||
llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 128
|
||
llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 128
|
||
llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000,000000
|
||
llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0,000001
|
||
llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 6
|
||
llama_model_loader: - kv 18: general.file_type u32 = 23
|
||
llama_model_loader: - kv 19: deepseek2.leading_dense_block_count u32 = 1
|
||
llama_model_loader: - kv 20: deepseek2.vocab_size u32 = 102400
|
||
llama_model_loader: - kv 21: deepseek2.attention.q_lora_rank u32 = 1536
|
||
llama_model_loader: - kv 22: deepseek2.attention.kv_lora_rank u32 = 512
|
||
llama_model_loader: - kv 23: deepseek2.attention.key_length u32 = 192
|
||
llama_model_loader: - kv 24: deepseek2.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 25: deepseek2.expert_feed_forward_length u32 = 1536
|
||
llama_model_loader: - kv 26: deepseek2.expert_count u32 = 160
|
||
llama_model_loader: - kv 27: deepseek2.expert_shared_count u32 = 2
|
||
llama_model_loader: - kv 28: deepseek2.expert_weights_scale f32 = 16,000000
|
||
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
|
||
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
|
||
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40,000000
|
||
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
|
||
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0,100000
|
||
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
|
||
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-llm
|
||
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
||
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
||
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
|
||
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 100000
|
||
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 100001
|
||
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 100001
|
||
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
|
||
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
|
||
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
|
||
llama_model_loader: - kv 45: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 46: quantize.imatrix.file str = /models_out/DeepSeek-V2.5-1210-GGUF/D...
|
||
llama_model_loader: - kv 47: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
|
||
llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 716
|
||
llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 139
|
||
llama_model_loader: - kv 50: split.no u16 = 0
|
||
llama_model_loader: - kv 51: split.count u16 = 3
|
||
llama_model_loader: - kv 52: split.tensors.count i32 = 959
|
||
llama_model_loader: - type f32: 300 tensors
|
||
llama_model_loader: - type q5_K: 1 tensors
|
||
llama_model_loader: - type iq3_xxs: 597 tensors
|
||
llama_model_loader: - type iq3_s: 61 tensors
|
||
llm_load_vocab: special tokens cache size = 18
|
||
llm_load_vocab: token to piece cache size = 0,6411 MB
|
||
llm_load_print_meta: format = GGUF V3 (latest)
|
||
llm_load_print_meta: arch = deepseek2
|
||
llm_load_print_meta: vocab type = BPE
|
||
llm_load_print_meta: n_vocab = 102400
|
||
llm_load_print_meta: n_merges = 99757
|
||
llm_load_print_meta: vocab_only = 0
|
||
llm_load_print_meta: n_ctx_train = 163840
|
||
llm_load_print_meta: n_embd = 5120
|
||
llm_load_print_meta: n_layer = 60
|
||
llm_load_print_meta: n_head = 128
|
||
llm_load_print_meta: n_head_kv = 128
|
||
llm_load_print_meta: n_rot = 64
|
||
llm_load_print_meta: n_swa = 0
|
||
llm_load_print_meta: n_swa_pattern = 1
|
||
llm_load_print_meta: n_embd_head_k = 192
|
||
llm_load_print_meta: n_embd_head_v = 128
|
||
llm_load_print_meta: n_gqa = 1
|
||
llm_load_print_meta: n_embd_k_gqa = 24576
|
||
llm_load_print_meta: n_embd_v_gqa = 16384
|
||
llm_load_print_meta: f_norm_eps = 0,0e+00
|
||
llm_load_print_meta: f_norm_rms_eps = 1,0e-06
|
||
llm_load_print_meta: f_clamp_kqv = 0,0e+00
|
||
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
|
||
llm_load_print_meta: f_logit_scale = 0,0e+00
|
||
llm_load_print_meta: n_ff = 12288
|
||
llm_load_print_meta: n_expert = 160
|
||
llm_load_print_meta: n_expert_used = 6
|
||
llm_load_print_meta: causal attn = 1
|
||
llm_load_print_meta: pooling type = 0
|
||
llm_load_print_meta: rope type = 0
|
||
llm_load_print_meta: rope scaling = yarn
|
||
llm_load_print_meta: freq_base_train = 10000,0
|
||
llm_load_print_meta: freq_scale_train = 0,025
|
||
llm_load_print_meta: n_ctx_orig_yarn = 4096
|
||
llm_load_print_meta: rope_finetuned = unknown
|
||
llm_load_print_meta: ssm_d_conv = 0
|
||
llm_load_print_meta: ssm_d_inner = 0
|
||
llm_load_print_meta: ssm_d_state = 0
|
||
llm_load_print_meta: ssm_dt_rank = 0
|
||
llm_load_print_meta: model type = 236B
|
||
llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw
|
||
llm_load_print_meta: model params = 235,741 B
|
||
llm_load_print_meta: model size = 84,604 GiB (3,083 BPW)
|
||
llm_load_print_meta: repeating layers = 84,058 GiB (3,077 BPW, 234,693 B parameters)
|
||
llm_load_print_meta: general.name = DeepSeek V2.5 1210
|
||
llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>'
|
||
llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>'
|
||
llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>'
|
||
llm_load_print_meta: LF token = 126 'Ä'
|
||
llm_load_print_meta: max token length = 256
|
||
llm_load_print_meta: n_layer_dense_lead = 1
|
||
llm_load_print_meta: n_lora_q = 1536
|
||
llm_load_print_meta: n_lora_kv = 512
|
||
llm_load_print_meta: n_ff_exp = 1536
|
||
llm_load_print_meta: n_expert_shared = 2
|
||
llm_load_print_meta: expert_weights_scale = 16,0
|
||
llm_load_print_meta: expert_weights_norm = 0
|
||
llm_load_print_meta: expert_gating_func = softmax
|
||
llm_load_print_meta: rope_yarn_log_mul = 0,1000
|
||
llm_load_tensors: ggml ctx size = 0,80 MiB
|
||
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
|
||
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
|
||
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
|
||
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
|
||
llm_load_tensors: offloading 60 repeating layers to GPU
|
||
llm_load_tensors: offloading non-repeating layers to GPU
|
||
llm_load_tensors: offloaded 61/61 layers to GPU
|
||
llm_load_tensors: CPU buffer size = 37343,30 MiB
|
||
llm_load_tensors: CPU buffer size = 37866,68 MiB
|
||
llm_load_tensors: CPU buffer size = 10656,64 MiB
|
||
llm_load_tensors: CPU buffer size = 214,84 MiB
|
||
llm_load_tensors: CUDA0 buffer size = 5109,97 MiB
|
||
....................................................................................................
|
||
============ llm_load_tensors: need to compute 60 wk_b tensors
|
||
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
|
||
llama_new_context_with_model: n_ctx = 32768
|
||
llama_new_context_with_model: n_batch = 2048
|
||
llama_new_context_with_model: n_ubatch = 512
|
||
llama_new_context_with_model: flash_attn = 1
|
||
llama_new_context_with_model: mla_attn = 3
|
||
llama_new_context_with_model: attn_max_b = 0
|
||
llama_new_context_with_model: fused_moe = 0
|
||
llama_new_context_with_model: ser = -1, 0
|
||
llama_new_context_with_model: freq_base = 10000,0
|
||
llama_new_context_with_model: freq_scale = 0,025
|
||
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
|
||
llama_kv_cache_init: CUDA0 KV buffer size = 1147,53 MiB
|
||
llama_new_context_with_model: KV self size = 1147,50 MiB, c^KV (q8_0): 1147,50 MiB, kv^T: not used
|
||
llama_new_context_with_model: CUDA_Host output buffer size = 0,39 MiB
|
||
llama_new_context_with_model: CUDA0 compute buffer size = 6346,00 MiB
|
||
llama_new_context_with_model: CUDA_Host compute buffer size = 74,01 MiB
|
||
llama_new_context_with_model: graph nodes = 3350
|
||
llama_new_context_with_model: graph splits = 179
|
||
main: chat template example: You are a helpful assistant
|
||
|
||
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
|
||
|
||
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
|
||
main: interactive mode on.
|
||
sampling:
|
||
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
|
||
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
|
||
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
|
||
sampling order:
|
||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
|
||
generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 1
|
||
|
||
|
||
== Running in interactive mode. ==
|
||
- Press Ctrl+C to interject at any time.
|
||
- Press Return to return control to the AI.
|
||
- To return control without starting a new line, end your input with '/'.
|
||
- If you want to submit another line, end your input with '\'.
|
||
|
||
</code>
|
||
> Count from 1 to 10 in French
|
||
|
||
Sure! Here's how you count from 1 to 10 in French:
|
||
|
||
1. Un
|
||
2. Deux
|
||
3. Trois
|
||
4. Quatre
|
||
5. Cinq
|
||
6. Six
|
||
7. Sept
|
||
8. Huit
|
||
9. Neuf
|
||
10. Dix
|
||
|
||
> Tell me everything you know about radiation therapy of cancer
|
||
|
||
Radiation therapy, also known as radiotherapy, is a medical treatment that uses high doses of radiation to kill cancer cells and shrink tumors. Here’s an overview of everything you need to know about radiation therapy for cancer:
|
||
|
||
### **Types of Radiation Therapy**
|
||
|
||
1. **External Beam Radiation Therapy (EBRT):**
|
||
- **Description:** Radiation is delivered from a machine outside the body, targeting the cancerous tumor.
|
||
- **Common Techniques:**
|
||
- **3D Conformal Radiation Therapy:** Uses multiple radiation beams to shape the treatment to the tumor’s 3D shape.
|
||
- **Intensity-Modulated Radiation Therapy (IMRT):** Delivers varying doses of radiation to different parts of the tumor, reducing damage to nearby healthy tissues.
|
||
- **Proton Therapy:** Uses protons instead of X-rays, allowing for precise targeting of the tumor with minimal radiation to surrounding tissues.
|
||
- **Stereotactic Radiosurgery (SRS) and Stereotactic Body Radiation Therapy (SBRT):** High-precision techniques used for small tumors or lesions, often in the brain or lung.
|
||
|
||
2. **Internal Radiation Therapy (Brachytherapy):**
|
||
- **Description:** Radioactive sources are placed inside the body, either temporarily or permanently, directly into or near the tumor.
|
||
- **Types:**
|
||
- **High Dose Rate (HDR) Brachytherapy:** Temporary placement of radioactive material for a short period.
|
||
- **Low Dose Rate (LDR) Brachytherapy:** Permanent placement of radioactive seeds, commonly used for prostate cancer.
|
||
|
||
3. **Systemic Radiation Therapy:**
|
||
- **Description:** Radioactive substances are administered through the bloodstream, targeting cancer cells throughout the body.
|
||
- **Examples:**
|
||
- **Radioactive iodine (I-131)** for thyroid cancer.
|
||
- **Lutetium-177 (Lu-177) or Yttrium-90 (Y-90)** for neuroendocrine tumors.
|
||
|
||
### **Purpose of Radiation Therapy**
|
||
|
||
1. **Cancer Treatment:**
|
||
- **Curative Intent:** To eliminate the cancer completely, often used in early-stage cancers.
|
||
- **Palliative Treatment:** To relieve symptoms and improve quality of life for advanced-stage cancers.
|
||
- **Adjuvant Therapy:** Used after surgery to eliminate any remaining cancer cells.
|
||
- **Neoadjuvant Therapy:** Used before surgery to shrink the tumor, making surgery easier and potentially reducing the extent of surgery needed.
|
||
|
||
2. **Prevention of Recurrence:**
|
||
- Radiation may be used to reduce the risk of cancer recurrence in high-risk patients.
|
||
|
||
### **Side Effects of Radiation Therapy**
|
||
|
||
1. **Acute Side Effects:**
|
||
- **Skin Reactions:** Redness, irritation, and peeling.
|
||
- **Fatigue:** Common and often temporary.
|
||
- **Swelling or Edema:** Due to fluid accumulation in treated areas.
|
||
- **Mucositis:** Inflammation of the mucous membranes, common in head and neck cancers.
|
||
|
||
2. **Late Side Effects:**
|
||
- **Tissue Fibrosis:** Scarring of the irradiated tissues.
|
||
- **Lymphedema:** Swelling due to damage to the lymphatic system.
|
||
- **Cognitive Changes:** Rare, but possible in brain radiation.
|
||
- **Secondary Cancers:** Extremely rare but possible years after treatment.
|
||
|
||
### **Factors Influencing Radiation Therapy**
|
||
|
||
1. **Tumor Characteristics:**
|
||
- Type, size, and location of the tumor.
|
||
- Whether the tumor is localized or has spread (metastatic).
|
||
|
||
2. **Patient Characteristics:**
|
||
- Age and overall health.
|
||
- Previous cancer treatments and their outcomes.
|
||
- Genetic factors and cancer predisposition.
|
||
|
||
3. **Radiation Techniques and Doses:**
|
||
- The type of radiation used and the dose administered.
|
||
- The fractionation schedule (how the dose is divided over time).
|
||
|
||
### **Advances in Radiation Therapy**
|
||
|
||
1. **Imaging and Targeting:**
|
||
- Advanced imaging techniques like PET scans help in precisely targeting the tumor.
|
||
- Image-guided radiation therapy (IGRT) uses real-time imaging to adjust the radiation beam during treatment.
|
||
|
||
2. **Personalized Medicine:**
|
||
- Customized treatment plans based on the individual patient’s tumor characteristics and genetics.
|
||
|
||
3. **Integration with Other Treatments:**
|
||
- Combination with chemotherapy, immunotherapy, or surgery for better outcomes.
|
||
|
||
### **Conclusion**
|
||
|
||
Radiation therapy is a powerful tool in the fight against cancer, offering curative potential, symptom relief, and preventive measures. Advances in technology continue to improve its precision and reduce side effects, making it an essential component of modern cancer care.
|
||
|
||
> How many r's are there in strawberry
|
||
|
||
The word "strawberry" contains **3 r's**.
|
||
|
||
> Can you solve the following system of equations for me? x + y = 2 x - y = 3
|
||
|
||
Certainly! Let's solve the system of equations step by step:
|
||
|
||
We are given two equations:
|
||
|
||
1. \( x + y = 2 \)
|
||
2. \( x - y = 3 \)
|
||
|
||
### Step 1: Add the two equations
|
||
|
||
Adding the left-hand sides:
|
||
\[ (x + y) + (x - y) = 2 + 3 \]
|
||
\[ 2x = 5 \]
|
||
|
||
### Step 2: Solve for \( x \)
|
||
|
||
\[ 2x = 5 \]
|
||
\[ x = \frac{5}{2} \]
|
||
\[ x = 2.5 \]
|
||
|
||
### Step 3: Substitute \( x = 2.5 \) into one of the original equations to find \( y \)
|
||
|
||
Let's use the first equation:
|
||
\[ x + y = 2 \]
|
||
\[ 2.5 + y = 2 \]
|
||
\[ y = 2 - 2.5 \]
|
||
\[ y = -0.5 \]
|
||
|
||
### Final Solution
|
||
|
||
The solution to the system of equations is:
|
||
\[ x = 2.5 \]
|
||
\[ y = -0.5 \]
|
||
|
||
> Can you explain the German electoral system?
|
||
|
||
Certainly! The German electoral system is characterized by a mixed-member proportional representation (MMP) system, which combines elements of both constituency-based voting and proportional representation. Here’s an overview of how it works:
|
||
|
||
### **Electoral System Overview**
|
||
|
||
1. **Electoral Districts:**
|
||
- Germany is divided into 299 single-member constituencies (Wahlkreise), each represented by one directly elected Member of Parliament (MP).
|
||
- Voters cast two votes: one for a candidate in their local constituency (Erststimme or first vote) and one for a political party (Zweitstimme or second vote).
|
||
|
||
2. **Directly Elected MPs (Erststimme):**
|
||
- The candidate with the most votes in each constituency wins a seat in the Bundestag (Federal Parliament).
|
||
|
||
3. **Party List MPs (Zweitstimme):**
|
||
- The second vote determines the proportional representation of parties in the Bundestag.
|
||
- Each party’s share of the total second votes determines how many additional seats they receive beyond the directly elected MPs.
|
||
|
||
### **Seat Allocation Process**
|
||
|
||
1. **Initial Allocation:**
|
||
- Directly elected MPs from the constituencies are first seated.
|
||
- The remaining seats are filled based on the proportion of second votes each party receives.
|
||
|
||
2. **Overhang Seats:**
|
||
- If a party wins significantly more constituency seats than its share of the second votes would allow, additional "overhang" seats are added to maintain proportionality.
|
||
|
||
3. **Compensatory Seats:**
|
||
- To ensure overall proportionality, compensatory seats are allocated to parties that did not win enough constituency seats to match their share of the second votes.
|
||
|
||
### **Formation of the Bundestag**
|
||
|
||
1. **Minimum Threshold:**
|
||
- Parties must receive at least 5% of the second votes nationwide or win at least three constituency seats to enter the Bundestag.
|
||
|
||
2. **Government Formation:**
|
||
- The party or coalition with the majority of seats forms the government.
|
||
- If no single party has a majority, a coalition government is formed between two or more parties.
|
||
|
||
### **Key Features**
|
||
|
||
1. **Stability:**
|
||
- The system promotes stable government formation by allowing for coalition-building among parties with similar political platforms.
|
||
|
||
2. **Representation:**
|
||
- Proportional representation ensures that smaller parties have a voice in the Bundestag, reflecting the diversity of political views in the electorate.
|
||
|
||
3. **Direct Representation:**
|
||
- Voters have a personal connection to their directly elected MP, enhancing local representation.
|
||
|
||
### **Conclusion**
|
||
|
||
The German electoral system is designed to balance local representation with proportional representation of political parties, ensuring a stable and representative government. This hybrid system has been successful in maintaining political stability while reflecting the diverse political landscape of Germany.
|
||
|
||
> Can you find the encoding rule for the following exmple: Encoded text: oyfjdnisdr rtqwainr acxz mynzbhhx Decoded text: Think step by step
|
||
|
||
To decode the encoded text "oyfjdnisdr rtqwainr acxz mynzbhhx" to "Think step by step," we need to find the encoding rule. Given the nature of the encoded text and the decoded text, it seems likely that the encoding involves a simple substitution cipher, where each letter in the plaintext is replaced by a different letter.
|
||
|
||
Let's break down the encoded text and compare it with the decoded text:
|
||
|
||
Encoded Text:
|
||
`oyfjdnisdr rtqwainr acxz mynzbhhx`
|
||
|
||
Decoded Text:
|
||
`Think step by step`
|
||
|
||
### Step 1: Analyze the Lengths
|
||
- "Think step by step" has 16 characters.
|
||
- "oyfjdnisdr rtqwainr acxz mynzbhhx" has 32 characters.
|
||
|
||
This suggests that each letter in the decoded text corresponds to two letters in the encoded text.
|
||
|
||
### Step 2: Identify Letter Pairs
|
||
Let's align the letters:
|
||
|
||
Decoded Text:
|
||
`T h i n k s t e p b y s t e p`
|
||
|
||
Encoded Text:
|
||
`o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x`
|
||
|
||
### Step 3: Match Letters
|
||
Now, let's match each pair of letters from the encoded text to a single letter in the decoded text:
|
||
|
||
- `T` → `o y`
|
||
- `h` → `f j`
|
||
- `i` → `d n`
|
||
- `n` → `i s`
|
||
- `k` → `d r`
|
||
- `s` → `r t`
|
||
- `t` → `q w`
|
||
- `e` → `a i`
|
||
- `p` → `n r`
|
||
- `b` → `a c`
|
||
- `y` → `x z`
|
||
- `s` → `m y`
|
||
- `t` → `n z`
|
||
- `e` → `b h`
|
||
- `p` → `h x`
|
||
|
||
### Step 4: Formulate the Encoding Rule
|
||
From the above matching, we can infer that the encoding rule is:
|
||
|
||
- Each letter in the decoded text corresponds to two letters in the encoded text.
|
||
- The mapping is not a simple substitution but rather a pairing of letters.
|
||
|
||
### Conclusion
|
||
The encoding rule for this example is that each letter in the decoded text is represented by a pair of letters in the encoded text. The specific mapping can be derived from the given example:
|
||
|
||
- `T` → `o y`
|
||
- `h` → `f j`
|
||
- `i` → `d n`
|
||
- `n` → `i s`
|
||
- `k` → `d r`
|
||
- `s` → `r t`
|
||
- `t` → `q w`
|
||
- `e` → `a i`
|
||
- `p` → `n r`
|
||
- `b` → `a c`
|
||
- `y` → `x z`
|
||
- `s` → `m y`
|
||
- `t` → `n z`
|
||
- `e` → `b h`
|
||
- `p` → `h x`
|
||
|
||
This rule can be used to decode other similarly encoded texts.
|
||
|
||
>
|
||
|
||
llama_print_timings: load time = 11641,32 ms
|
||
llama_print_timings: sample time = 135,78 ms / 2822 runs ( 0,05 ms per token, 20782,86 tokens per second)
|
||
llama_print_timings: prompt eval time = 24467,67 ms / 126 tokens ( 194,19 ms per token, 5,15 tokens per second)
|
||
llama_print_timings: eval time = 224627,46 ms / 2816 runs ( 79,77 ms per token, 12,54 tokens per second)
|
||
llama_print_timings: total time = 517022,50 ms / 2942 tokens
|
||
</details>
|
||
|
||
Here a quick `sweep-bench` performance test
|
||
|
||
### `fp16` KV cache
|
||
|
||
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|
||
|-------|--------|--------|----------|----------|----------|----------|
|
||
| 2048 | 512 | 0 | 14.243 | 143.79 | 39.607 | 12.93 |
|
||
| 2048 | 512 | 2048 | 14.741 | 138.93 | 40.155 | 12.75 |
|
||
| 2048 | 512 | 4096 | 15.250 | 134.29 | 40.546 | 12.63 |
|
||
| 2048 | 512 | 6144 | 15.778 | 129.80 | 41.711 | 12.27 |
|
||
| 2048 | 512 | 8192 | 16.303 | 125.62 | 41.891 | 12.22 |
|
||
| 2048 | 512 | 10240 | 16.847 | 121.57 | 42.925 | 11.93 |
|
||
| 2048 | 512 | 12288 | 17.497 | 117.05 | 43.123 | 11.87 |
|
||
| 2048 | 512 | 14336 | 17.874 | 114.58 | 43.521 | 11.76 |
|
||
|
||
### `Q8_0` KV cache
|
||
|
||
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|
||
|-------|--------|--------|----------|----------|----------|----------|
|
||
| 2048 | 512 | 0 | 14.284 | 143.38 | 39.549 | 12.95 |
|
||
| 2048 | 512 | 2048 | 14.795 | 138.42 | 40.182 | 12.74 |
|
||
| 2048 | 512 | 4096 | 15.379 | 133.17 | 40.770 | 12.56 |
|
||
| 2048 | 512 | 6144 | 18.119 | 113.03 | 42.032 | 12.18 |
|
||
| 2048 | 512 | 8192 | 16.466 | 124.38 | 42.423 | 12.07 |
|
||
| 2048 | 512 | 10240 | 16.945 | 120.86 | 43.506 | 11.77 |
|
||
| 2048 | 512 | 12288 | 17.601 | 116.35 | 43.925 | 11.66 |
|
||
| 2048 | 512 | 14336 | 17.987 | 113.86 | 44.597 | 11.48 |
|
||
|
||
I.e., only very slightly slower than `fp16` KV cache. The KV cache is quite small with FlashMLA-3, but if one wants to go to 160k tokens with DeepSeek-V3/R1, using `Q8_0` KV cache instead of `fp16` may make the difference between being able or not being able to run with a single 24 GB GPU.
|
||
|
||
---
|
||
|
||
#### 💬 Conversation
|
||
|
||
👤 **JohannesGaessler** commented the **2025-05-09** at **07:23:38**:<br>
|
||
|
||
Thank you for notifying me. I am aware of the defect, on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory. I can't comment on the specific code in this PR since I won't look at it unless you explicitly tell me I'm allowed to do so even without the conflict between you and Georgi first being resolved. The way I would have gone about it would have been not to use the V tensor at all, to dequantize K, and to then calculate the pointer, dimension, and strides for a pseudo V tensor from the K tensor.
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-05-09** at **07:25:52**:<br>
|
||
|
||
Forgot to add `-rtr` in the above performance test. Here it is with `-rtr` and `q8_0` KV cache
|
||
|
||
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|
||
|-------|--------|--------|----------|----------|----------|----------|
|
||
| 2048 | 512 | 0 | 13.348 | 153.43 | 36.662 | 13.97 |
|
||
| 2048 | 512 | 2048 | 14.637 | 139.92 | 37.208 | 13.76 |
|
||
| 2048 | 512 | 4096 | 14.478 | 141.46 | 37.720 | 13.57 |
|
||
| 2048 | 512 | 6144 | 14.880 | 137.64 | 39.034 | 13.12 |
|
||
| 2048 | 512 | 8192 | 16.081 | 127.36 | 39.282 | 13.03 |
|
||
| 2048 | 512 | 10240 | 16.240 | 126.11 | 40.409 | 12.67 |
|
||
| 2048 | 512 | 12288 | 17.001 | 120.47 | 40.805 | 12.55 |
|
||
| 2048 | 512 | 14336 | 18.056 | 113.42 | 41.437 | 12.36 |
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-05-09** at **07:31:04**:<br>
|
||
|
||
> on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory.
|
||
|
||
Oh, yes, I forgot about that.
|
||
|
||
In any case, the PR in `ik_llama.cpp` is mostly a copy of your mainline PR, so you looking at the code you wrote in my repository hopefully does not break Georgi's rules.
|
||
|
||
---
|
||
|
||
👤 **JohannesGaessler** commented the **2025-05-09** at **07:49:51**:<br>
|
||
|
||
My concern specifically is whether you would consider any of my work on mainline after looking at your code to be including a "substantial portion" of your work and could thus only be included in conjunction with the copyright notices in ik_llama.cpp. Much like you I am not a lawyer but if you tell me that you will not consider me looking at your work to be a license violation (or that in some specific case you waive the requirement of copyright notices) then there is no need for lawyers in the first place. |