Files
ik_llama.cpp/github-data/discussions/266 - Benchmarking DeepSeek R1 - 16x3090.md
2025-07-23 13:31:53 +02:00

68 KiB
Raw Blame History

🗣️ #266 - Benchmarking DeepSeek R1 - 16x3090

Author davidsyoung
Created 2025-03-18
Updated 2025-03-21

Description

Wanted to create a resource for anyone looking to optimise -b -ub -amb with -mla 2 -fa -fmoe with offloading DeepSeek R1 fully on CUDA with ik_llama.cpp @ dcdfad29f7.

Layers are not evenly spread over 16 GPUs, and GPU utilisation is only at 5-10% on avg. <150w per GPU.

I'm not sure how useful this is, but ran it over night. It had an error on -b 4096 pp8192 due to OOM but still feel it's useful!

model size params backend ngl n_batch n_ubatch fa mla amb fmoe test t/s
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 pp512 216.01 ± 4.70
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 pp1024 219.99 ± 2.45
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 pp2048 219.74 ± 1.46
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 pp4096 208.57 ± 0.58
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 pp8192 183.37 ± 0.73
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 tg128 17.22 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 tg256 17.84 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 tg512 18.06 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 tg1024 18.02 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 1024 1 tg2048 17.74 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 pp512 238.55 ± 2.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 pp1024 235.57 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 pp2048 226.29 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 pp4096 208.86 ± 0.10
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 pp8192 182.56 ± 0.39
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 tg128 17.23 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 tg256 17.87 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 tg512 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 tg1024 18.01 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 512 1 tg2048 17.75 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 pp512 239.67 ± 1.22
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 pp1024 235.22 ± 1.85
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 pp2048 225.73 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 pp4096 207.66 ± 0.12
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 pp8192 179.22 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 tg128 17.25 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 tg256 17.85 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 tg512 18.05 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 tg1024 18.04 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 128 1 tg2048 17.77 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 pp512 239.69 ± 0.92
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 pp1024 235.48 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 pp2048 224.92 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 pp4096 205.77 ± 0.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 pp8192 176.72 ± 0.14
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 tg128 17.21 ± 0.08
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 tg256 17.85 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 tg512 18.05 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 tg1024 18.04 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 64 1 tg2048 17.77 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 pp512 236.20 ± 0.76
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 pp1024 233.43 ± 0.95
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 pp2048 222.88 ± 0.17
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 pp4096 203.34 ± 0.16
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 pp8192 173.21 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 tg128 17.27 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 tg256 17.85 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 tg512 18.06 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 tg1024 18.02 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 512 1 2 32 1 tg2048 17.79 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 pp512 238.70 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 pp1024 303.92 ± 1.82
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 pp2048 295.71 ± 0.91
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 pp4096 276.63 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 pp8192 244.18 ± 0.26
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 tg128 17.26 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 tg256 17.79 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 tg512 18.09 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 tg1024 18.04 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 1024 1 tg2048 17.77 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 pp512 239.64 ± 1.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 pp1024 305.79 ± 0.40
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 pp2048 296.58 ± 0.75
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 pp4096 276.62 ± 0.54
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 pp8192 244.26 ± 0.31
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 tg128 17.27 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 tg256 17.88 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 tg512 18.09 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 tg1024 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 512 1 tg2048 17.70 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 pp512 238.73 ± 1.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 pp1024 304.83 ± 0.61
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 pp2048 295.23 ± 0.09
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 pp4096 275.28 ± 0.29
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 pp8192 239.76 ± 0.39
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 tg128 17.21 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 tg256 17.82 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 tg512 18.05 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 tg1024 18.01 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 128 1 tg2048 17.71 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 pp512 237.98 ± 0.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 pp1024 304.20 ± 0.22
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 pp2048 293.80 ± 1.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 pp4096 272.19 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 pp8192 235.64 ± 0.42
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 tg128 17.14 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 tg256 17.79 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 tg512 18.02 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 tg1024 18.00 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 64 1 tg2048 17.72 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 pp512 238.40 ± 1.47
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 pp1024 301.66 ± 1.64
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 pp2048 290.44 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 pp4096 267.12 ± 0.09
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 pp8192 229.98 ± 0.19
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 tg128 17.16 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 tg256 17.76 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 tg512 18.01 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 tg1024 17.97 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 1024 1 2 32 1 tg2048 17.73 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 pp512 240.23 ± 1.70
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 pp1024 305.03 ± 0.60
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 pp2048 349.22 ± 0.37
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 pp4096 327.33 ± 0.82
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 pp8192 290.90 ± 0.26
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 tg128 17.21 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 tg256 17.84 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 tg512 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 tg1024 18.01 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 1024 1 tg2048 17.74 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 pp512 239.12 ± 3.60
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 pp1024 305.13 ± 1.86
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 pp2048 349.84 ± 0.12
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 pp4096 328.46 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 pp8192 290.47 ± 0.23
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 tg128 17.24 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 tg256 17.81 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 tg512 18.02 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 tg1024 18.04 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 512 1 tg2048 17.79 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 pp512 238.52 ± 1.44
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 pp1024 304.77 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 pp2048 348.11 ± 0.69
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 pp4096 326.30 ± 0.69
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 pp8192 288.35 ± 0.12
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 tg128 17.24 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 tg256 17.88 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 tg512 18.07 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 tg1024 18.05 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 128 1 tg2048 17.77 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 pp512 238.42 ± 1.40
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 pp1024 304.32 ± 1.66
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 pp2048 344.70 ± 1.92
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 pp4096 323.64 ± 0.60
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 pp8192 283.02 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 tg128 17.22 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 tg256 17.86 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 tg512 18.06 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 tg1024 18.06 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 64 1 tg2048 17.79 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 pp512 236.64 ± 1.54
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 pp1024 301.44 ± 1.56
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 pp2048 343.13 ± 0.36
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 pp4096 317.60 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 pp8192 274.27 ± 0.22
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 tg128 17.28 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 tg256 17.89 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 tg512 18.08 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 tg1024 18.05 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 2048 1 2 32 1 tg2048 17.78 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 pp512 238.37 ± 1.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 pp1024 304.95 ± 1.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 pp2048 349.14 ± 0.52
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 pp4096 327.89 ± 0.19
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 pp8192 291.05 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 tg128 17.25 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 tg256 17.81 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 tg512 18.06 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 tg1024 18.04 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 1024 1 tg2048 17.78 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 pp512 238.06 ± 0.70
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 pp1024 304.73 ± 0.74
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 pp2048 348.72 ± 1.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 pp4096 328.20 ± 0.51
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 pp8192 290.87 ± 0.49
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 tg128 17.27 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 tg256 17.88 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 tg512 18.09 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 tg1024 18.04 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 512 1 tg2048 17.72 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 pp512 239.80 ± 0.46
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 pp1024 306.38 ± 1.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 pp2048 348.17 ± 0.55
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 pp4096 325.50 ± 0.88
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 pp8192 288.20 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 tg128 17.25 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 tg256 17.83 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 tg512 18.10 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 tg1024 18.06 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 128 1 tg2048 17.76 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 pp512 237.92 ± 2.32
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 pp1024 304.37 ± 0.47
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 pp2048 347.09 ± 0.66
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 pp4096 323.48 ± 0.46
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 pp8192 283.28 ± 0.14
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 tg128 17.20 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 tg256 17.86 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 tg512 18.05 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 tg1024 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 64 1 tg2048 17.78 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 pp512 238.77 ± 2.73
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 pp1024 302.54 ± 0.90
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 pp2048 342.62 ± 0.56
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 pp4096 317.58 ± 0.10
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 pp8192 274.23 ± 0.40
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 tg128 17.27 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 tg256 17.88 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 tg512 18.09 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 tg1024 17.98 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 2048 4096 1 2 32 1 tg2048 17.78 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 pp512 240.30 ± 2.99
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 pp1024 236.20 ± 1.81
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 pp2048 226.46 ± 0.49
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 pp4096 209.52 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 pp8192 183.03 ± 0.23
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 tg128 17.24 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 tg256 17.89 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 tg512 18.08 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 tg1024 18.06 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 1024 1 tg2048 17.77 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 pp512 238.21 ± 0.99
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 pp1024 236.32 ± 1.53
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 pp2048 225.41 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 pp4096 209.14 ± 0.30
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 pp8192 182.42 ± 0.08
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 tg128 17.24 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 tg256 17.86 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 tg512 18.09 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 tg1024 18.06 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 512 1 tg2048 17.78 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 pp512 239.31 ± 0.11
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 pp1024 234.58 ± 0.88
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 pp2048 224.77 ± 0.60
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 pp4096 207.35 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 pp8192 178.79 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 tg128 17.26 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 tg256 17.88 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 tg512 18.07 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 tg1024 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 128 1 tg2048 17.78 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 pp512 239.12 ± 0.21
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 pp1024 235.30 ± 1.41
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 pp2048 224.94 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 pp4096 206.20 ± 0.28
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 pp8192 176.54 ± 0.17
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 tg128 17.29 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 tg256 17.86 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 tg512 18.07 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 tg1024 17.99 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 64 1 tg2048 17.72 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 pp512 238.94 ± 0.70
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 pp1024 233.23 ± 0.45
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 pp2048 222.40 ± 0.23
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 pp4096 203.04 ± 0.51
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 pp8192 173.09 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 tg128 17.25 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 tg256 17.89 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 tg512 18.06 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 tg1024 18.04 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 512 1 2 32 1 tg2048 17.76 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 pp512 239.80 ± 0.48
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 pp1024 305.07 ± 0.33
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 pp2048 295.09 ± 0.13
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 pp4096 275.70 ± 0.25
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 pp8192 243.52 ± 0.27
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 tg128 17.25 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 tg256 17.87 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 tg512 18.03 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 tg1024 17.97 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 1024 1 tg2048 17.72 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 pp512 241.05 ± 0.59
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 pp1024 304.85 ± 1.84
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 pp2048 295.04 ± 0.48
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 pp4096 276.20 ± 0.08
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 pp8192 243.36 ± 0.27
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 tg128 17.17 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 tg256 17.79 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 tg512 18.00 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 tg1024 17.98 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 512 1 tg2048 17.76 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 pp512 238.47 ± 0.34
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 pp1024 305.42 ± 1.32
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 pp2048 295.28 ± 0.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 pp4096 274.18 ± 0.37
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 pp8192 239.55 ± 0.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 tg128 17.27 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 tg256 17.85 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 tg512 17.99 ± 0.06
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 tg1024 18.04 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 128 1 tg2048 17.77 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 pp512 239.49 ± 0.90
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 pp1024 303.09 ± 1.76
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 pp2048 292.21 ± 1.47
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 pp4096 271.27 ± 0.16
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 pp8192 234.84 ± 0.11
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 tg128 17.23 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 tg256 17.83 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 tg512 18.06 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 tg1024 18.05 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 64 1 tg2048 17.73 ± 0.05
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 pp512 238.09 ± 1.33
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 pp1024 302.10 ± 0.35
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 pp2048 289.34 ± 0.51
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 pp4096 266.76 ± 0.16
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 pp8192 229.52 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 tg128 17.29 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 tg256 17.80 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 tg512 18.07 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 tg1024 18.04 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 1024 1 2 32 1 tg2048 17.74 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 pp512 239.40 ± 0.85
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 pp1024 304.81 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 pp2048 348.47 ± 1.08
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 pp4096 327.77 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 pp8192 290.58 ± 0.18
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 tg128 17.26 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 tg256 17.86 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 tg512 18.08 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 tg1024 18.01 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 1024 1 tg2048 17.67 ± 0.11
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 pp512 239.10 ± 1.34
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 pp1024 304.24 ± 2.13
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 pp2048 348.34 ± 0.82
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 pp4096 327.32 ± 0.20
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 pp8192 290.58 ± 0.09
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 tg128 17.27 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 tg256 17.83 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 tg512 18.06 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 tg1024 18.04 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 512 1 tg2048 17.71 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 pp512 239.16 ± 0.38
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 pp1024 304.15 ± 0.87
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 pp2048 347.30 ± 0.52
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 pp4096 325.70 ± 0.67
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 pp8192 287.87 ± 0.21
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 tg128 17.20 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 tg256 17.82 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 tg512 18.04 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 tg1024 18.01 ± 0.00
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 128 1 tg2048 17.72 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 pp512 240.31 ± 3.17
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 pp1024 303.77 ± 1.31
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 pp2048 346.19 ± 0.76
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 pp4096 323.25 ± 0.24
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 pp8192 282.42 ± 0.07
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 tg128 17.18 ± 0.12
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 tg256 17.79 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 tg512 17.99 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 tg1024 18.02 ± 0.02
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 64 1 tg2048 17.78 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 pp512 237.68 ± 1.86
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 pp1024 302.20 ± 1.45
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 pp2048 342.06 ± 0.96
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 pp4096 317.32 ± 0.50
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 pp8192 273.87 ± 0.54
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 tg128 17.28 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 tg256 17.85 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 tg512 18.03 ± 0.03
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 tg1024 18.04 ± 0.04
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 2048 1 2 32 1 tg2048 17.77 ± 0.01
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 4096 1 2 1024 1 pp512 238.93 ± 0.91
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 4096 1 2 1024 1 pp1024 305.36 ± 0.21
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 4096 1 2 1024 1 pp2048 348.42 ± 0.27
deepseek2 671B Q8_0 307.20 GiB 672.05 B CUDA 63 4096 4096 1 2 1024 1 pp4096 346.42 ± 0.52

Feel free to create whichever interesting graphs you find from it, as there's a lot of data it's quite hard to isolate:

PP

Image Image Image

TG shows no notable difference.


🗣️ Discussion

👤 davidsyoung replied the 2025-03-18 at 09:37:29:

Mixed quant of Q8 for attn, Q5 down / IQ4_XS up|gate for layers 3-8, and IQ4_XS down / IQ3_S up|gate.

Component Blocks 0-2 Blocks 3-8 Blocks 9-60
Attention Query/Key/Value q8_0 q8_0 q8_0
Attention Output q8_0 q8_0 q8_0
FFN Down (regular) q8_0 - -
FFN Gate/Up (regular) q8_0 - -
FFN Down Shared Experts - q5_K q5_K
FFN Gate/Up Shared Experts - q5_K q5_K
FFN Down Experts - q5_K iq4_xs
FFN Gate/Up Experts - iq4_xs iq3_s
Output Layer q8_0 q8_0 q8_0
Compression Results
Original size: 1,282,038 MB (~1.2 TB)
Quantized size: 314,569 MB (~307 GB)
Compression ratio: 4.1x

PPL

perplexity: tokenizing the input ..
perplexity: tokenization took 1195.26 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 11.69 seconds per pass - ETA 27.32 minutes
[1]2.5779,[2]3.3447,[3]2.4073,[4]2.0140,[5]1.8352,[6]1.6862,[7]1.5895,[8]1.5208,[9]1.4715,[10]1.4284,[11]1.4147,[12]1.4406,[13]1.4529,[14]1.5824,[15]1.7144,[16]1.7752,[17]1.9408,[18]2.0703,[19]2.0333,[20]2.0250,[21]2.1305,[22]2.1021,[23]2.0764,[24]2.0880,[25]2.0581,[26]2.0330,[27]2.0797,[28]2.0888,[29]2.1391,[30]2.1698,[31]2.2044,[32]2.2227,[33]2.2626,[34]2.3049,[35]2.3566,[36]2.4115,[37]2.4463,[38]2.4930,[39]2.5346,[40]2.5926,[41]2.6353,[42]2.6458,[43]2.6948,[44]2.7107,[45]2.7909,[46]2.8420,[47]2.8003,[48]2.7549,[49]2.7298,[50]2.7498,[51]2.7964,[52]2.8105,[53]2.8597,[54]2.8734,[55]2.9047,[56]2.9384,[57]2.9550,[58]2.9926,[59]3.0027,[60]3.0502,[61]3.0906,[62]3.1475,[63]3.1812,[64]3.2262,[65]3.2360,[66]3.2179,[67]3.1954,[68]3.2271,[69]3.2225,[70]3.2377,[71]3.2562,[72]3.2726,[73]3.2860,[74]3.3095,[75]3.2881,[76]3.2396,[77]3.1959,[78]3.1931,[79]3.1728,[80]3.1563,[81]3.1190,[82]3.1220,[83]3.0918,[84]3.0554,[85]3.0218,[86]2.9995,[87]2.9958,[88]2.9686,[89]2.9537,[90]2.9261,[91]2.8966,[92]2.8704,[93]2.8441,[94]2.8196,[95]2.7964,[96]2.7947,[97]2.8024,[98]2.7882,[99]2.7728,[100]2.7752,[101]2.7671,[102]2.7843,[103]2.8105,[104]2.8288,[105]2.8261,[106]2.8486,[107]2.8737,[108]2.8953,[109]2.9296,[110]2.9637,[111]2.9837,[112]2.9567,[113]2.9436,[114]2.9207,[115]2.9047,[116]2.8905,[117]2.8672,[118]2.8450,[119]2.8235,[120]2.8040,[121]2.7884,[122]2.7698,[123]2.7532,[124]2.7334,[125]2.7156,[126]2.6981,[127]2.6840,[128]2.6757,[129]2.6662,[130]2.6551,[131]2.6472,[132]2.6548,[133]2.6649,[134]2.6714,[135]2.6822,[136]2.6990,[137]2.7145,[138]2.7231,[139]2.7348,[140]2.7353,[141]2.7368,[142]2.7356,[143]2.7359,[144]2.7320,[145]2.7228,[146]2.7211,[147]2.7254,[148]2.7248,[149]2.7265,[150]2.7210,[151]2.7192,[152]2.7157,[153]2.7114,[154]2.7119,[155]2.7159,[156]2.7180,[157]2.7237,[158]2.7322,[159]2.7339,[160]2.7428,[161]2.7509,[162]2.7605,[163]2.7660,[164]2.7863,[165]2.8095,[166]2.8270,[167]2.8399,[168]2.8647,[169]2.8872,[170]2.9083,[171]2.9311,[172]2.9150,[173]2.8980,[174]2.8843,[175]2.8712,[176]2.8589,[177]2.8467,[178]2.8338,[179]2.8193,[180]2.8228,[181]2.8370,[182]2.8519,[183]2.8669,[184]2.8813,[185]2.8915,[186]2.9083,[187]2.9241,[188]2.9381,[189]2.9489,[190]2.9490,[191]2.9561,[192]2.9601,[193]2.9652,[194]2.9848,[195]2.9935,[196]3.0068,[197]3.0167,[198]3.0211,[199]3.0267,[200]3.0261,[201]3.0415,[202]3.0361,[203]3.0413,[204]3.0446,[205]3.0447,[206]3.0468,[207]3.0552,[208]3.0645,[209]3.0737,[210]3.0738,[211]3.0688,[212]3.0689,[213]3.0765,[214]3.0781,[215]3.0837,[216]3.0847,[217]3.0805,[218]3.0804,[219]3.0811,[220]3.0800,[221]3.0803,[222]3.0803,[223]3.0805,[224]3.0856,[225]3.0871,[226]3.0791,[227]3.0772,[228]3.0792,[229]3.0835,[230]3.0900,[231]3.0962,[232]3.0880,[233]3.0801,[234]3.0803,[235]3.0787,[236]3.0879,[237]3.0957,[238]3.1050,[239]3.1151,[240]3.1241,[241]3.1353,[242]3.1498,[243]3.1632,[244]3.1713,[245]3.1831,[246]3.1937,[247]3.1927,[248]3.1884,[249]3.1867,[250]3.1804,[251]3.1782,[252]3.1805,[253]3.1841,[254]3.1910,[255]3.1971,[256]3.2005,[257]3.2032,[258]3.2042,[259]3.2076,[260]3.2098,[261]3.2107,[262]3.2099,[263]3.2158,[264]3.2179,[265]3.2182,[266]3.2199,[267]3.2230,[268]3.2267,[269]3.2298,[270]3.2290,[271]3.2271,[272]3.2205,[273]3.2208,[274]3.2143,[275]3.2037,[276]3.1934,[277]3.1951,[278]3.2052,[279]3.2115,[280]3.2195,[281]3.2272,[282]3.2333,[283]3.2398,[284]3.2466,[285]3.2603,[286]3.2626,[287]3.2661,[288]3.2707,[289]3.2732,[290]3.2648,[291]3.2557,[292]3.2544,[293]3.2536,[294]3.2513,[295]3.2487,[296]3.2507,[297]3.2513,[298]3.2562,[299]3.2620,[300]3.2651,[301]3.2691,[302]3.2713,[303]3.2734,[304]3.2726,[305]3.2845,[306]3.2922,[307]3.3033,[308]3.2916,[309]3.2865,[310]3.2769,[311]3.2804,[312]3.2825,[313]3.2893,[314]3.2915,[315]3.2946,[316]3.2959,[317]3.2974,[318]3.2979,[319]3.2982,[320]3.3026,[321]3.3028,[322]3.3042,[323]3.3106,[324]3.3112,[325]3.3167,[326]3.3214,[327]3.3255,[328]3.3282,[329]3.3297,[330]3.3360,[331]3.3396,[332]3.3443,[333]3.3428,[334]3.3425,[335]3.3428,[336]3.3429,[337]3.3437,[338]3.3441,[339]3.3466,[340]3.3502,[341]3.3555,[342]3.3649,[343]3.3744,[344]3.3797,[345]3.3713,[346]3.3640,[347]3.3597,[348]3.3523,[349]3.3488,[350]3.3471,[351]3.3521,[352]3.3671,[353]3.3761,[354]3.3892,[355]3.3977,[356]3.4029,[357]3.4148,[358]3.4246,[359]3.4279,[360]3.4346,[361]3.4439,[362]3.4526,[363]3.4586,[364]3.4649,[365]3.4715,[366]3.4822,[367]3.4909,[368]3.4975,[369]3.5054,[370]3.5138,[371]3.5277,[372]3.5368,[373]3.5401,[374]3.5435,[375]3.5485,[376]3.5616,[377]3.5727,[378]3.5754,[379]3.5749,[380]3.5715,[381]3.5762,[382]3.5816,[383]3.5853,[384]3.5894,[385]3.5931,[386]3.5996,[387]3.6055,[388]3.6087,[389]3.5980,[390]3.5883,[391]3.5774,[392]3.5715,[393]3.5623,[394]3.5535,[395]3.5438,[396]3.5336,[397]3.5245,[398]3.5146,[399]3.5042,[400]3.4963,[401]3.4863,[402]3.4756,[403]3.4668,[404]3.4563,[405]3.4465,[406]3.4364,[407]3.4270,[408]3.4178,[409]3.4090,[410]3.4031,[411]3.4038,[412]3.3993,[413]3.4012,[414]3.4038,[415]3.4009,[416]3.4009,[417]3.4034,[418]3.3979,[419]3.3991,[420]3.3966,[421]3.3953,[422]3.3970,[423]3.3964,[424]3.4006,[425]3.4005,[426]3.4009,[427]3.3997,[428]3.4021,[429]3.4037,[430]3.4064,[431]3.4074,[432]3.4064,[433]3.4027,[434]3.4028,[435]3.3956,[436]3.3891,[437]3.3851,[438]3.3833,[439]3.3805,[440]3.3855,[441]3.3905,[442]3.3979,[443]3.3964,[444]3.3972,[445]3.3983,[446]3.4029,[447]3.4058,[448]3.4083,[449]3.4114,[450]3.4154,[451]3.4184,[452]3.4206,[453]3.4223,[454]3.4208,[455]3.4229,[456]3.4232,[457]3.4257,[458]3.4311,[459]3.4317,[460]3.4318,[461]3.4284,[462]3.4322,[463]3.4396,[464]3.4448,[465]3.4381,[466]3.4361,[467]3.4344,[468]3.4355,[469]3.4328,[470]3.4301,[471]3.4304,[472]3.4311,[473]3.4304,[474]3.4295,[475]3.4308,[476]3.4290,[477]3.4282,[478]3.4288,[479]3.4307,[480]3.4334,[481]3.4290,[482]3.4325,[483]3.4316,[484]3.4353,[485]3.4416,[486]3.4444,[487]3.4479,[488]3.4531,[489]3.4555,[490]3.4603,[491]3.4665,[492]3.4709,[493]3.4707,[494]3.4719,[495]3.4746,[496]3.4764,[497]3.4794,[498]3.4798,[499]3.4790,[500]3.4832,[501]3.4877,[502]3.4865,[503]3.4849,[504]3.4871,[505]3.4905,[506]3.4988,[507]3.5016,[508]3.5050,[509]3.4973,[510]3.4914,[511]3.4851,[512]3.4810,[513]3.4750,[514]3.4738,[515]3.4761,[516]3.4714,[517]3.4713,[518]3.4704,[519]3.4710,[520]3.4755,[521]3.4744,[522]3.4730,[523]3.4790,[524]3.4775,[525]3.4761,[526]3.4715,[527]3.4663,[528]3.4628,[529]3.4599,[530]3.4568,[531]3.4536,[532]3.4479,[533]3.4415,[534]3.4370,[535]3.4382,[536]3.4410,[537]3.4443,[538]3.4469,[539]3.4496,[540]3.4550,[541]3.4584,[542]3.4607,[543]3.4552,[544]3.4512,[545]3.4508,[546]3.4440,[547]3.4374,[548]3.4307,[549]3.4240,[550]3.4178,[551]3.4116,[552]3.4060,[553]3.4002,[554]3.3983,[555]3.3970,[556]3.3998,[557]3.4039,[558]3.4098,[559]3.4145,[560]3.4197,[561]3.4178,
Final estimate: PPL = 3.4178 +/- 0.01891

👤 fredlas replied the 2025-03-19 at 15:49:40:
Were you thinking of uploading this to huggingface, by any chance? I can reproduce and upload it myself if necessary, but I haven't downloaded the full R1 weights yet, and would be happy to continue avoiding that if possible!

👤 ubergarm replied the 2025-03-19 at 22:37:04:
@fredlas do you have any specific hardware configuration in mind? e.g. how much system RAM, and GPUs / VRAM? I put together rough notes on making your own custom quant in this quick-start guide discussion. I believe @davidsyoung has tailored the quant specific to his 16x3090 = 384 GB VRAM setup.

I've made a couple quants now and have one okay one for 256GB RAM + 24GB VRAM single GPU configuration with better perplexity than unsloth UD-Q2_K_XL but just a little bit slower. I'm still experimenting to see how the various types effect generation speed vs perplexity while fitting inside the envelope of my current hardware.

You can get started with ik_llama.cpp including -mla 2 and repacked quants now with an existing unsloth quant or whatever you have probably. (sorry if you already know this, I'm still new here!) Cheers!

👤 davidsyoung replied the 2025-03-19 at 23:18:56:
I might be able to upload if you give me enough time, however, I actually recommend getting used to quanting as theres a lot tweaking you may want to do.

For example, I dont actually think this quant suits my setup best yet, and Im actually underutilising one GPU. I just havent found a way to split the layers that well yet.

👤 fredlas replied the 2025-03-21 at 02:37:16:
@ubergarm 307GiB happens to be right around the size I'm thinking of. 72GiB VRAM + 256GiB RAM, for queuing up jobs to run overnight with 16k context - should just fit in there, I think. Funny coincidence for an extremely different configuration! Thanks for that guide - I made my own quants of Wizard2 8x22B a while back, but long enough that I was probably going to have to basically relearn it.

@davidsyoung I'd say don't upload them just for my sake if you weren't already planning to - I just thought I'd check in case I could stay lazy. Plus this size range is probably pretty niche anyways; might not really be worth it in terms of helping people.


👤 ikawrakow replied the 2025-03-18 at 09:44:15:

Thank you for this. I think it can be really useful for people.


👤 saood06 replied the 2025-03-18 at 20:14:25:

@ikawrakow Can I convert this to a discussion?


👤 davidsyoung replied the 2025-03-18 at 20:19:37:

All good with me @saood06


👤 ikawrakow replied the 2025-03-18 at 20:29:32:

@ikawrakow Can I convert this to a discussion?

Sure, go ahead