ik_llama.cpp/494 - IQ1_M_R4 CUDA implementation.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 13:14:09 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

5.8 KiB

Raw Blame History

🔀 #494 - IQ1_M_R4 CUDA implementation

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-05
Updated	2025-06-05

Description

To help the quest for the world's smallest DeepSeek model, this PR adds CUDA implementation for IQ1_M_R4.

GEMM is done via dequantize+cuBLAS, so may require cmake -DGGML_CUDA_IQK_FORCE_BF16=ON.

Performance is on par or even tiny bit better than IQ1_M.

Here sweep bench for LlaMA-3-8B on RTX-4080

IQ1_M

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.347	5909.51	2.466	207.66
2048	512	2048	0.329	6216.59	2.657	192.69
2048	512	4096	0.356	5745.00	2.928	174.88
2048	512	6144	0.384	5332.11	3.162	161.91
2048	512	8192	0.411	4983.68	3.380	151.50
2048	512	10240	0.438	4678.79	3.634	140.88
2048	512	12288	0.466	4398.46	3.830	133.68
2048	512	14336	0.494	4149.40	4.095	125.03

IQ1_M_R4 (PR)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.338	6058.78	2.440	209.81
2048	512	2048	0.323	6337.42	2.639	193.99
2048	512	4096	0.350	5859.50	2.914	175.71
2048	512	6144	0.379	5409.73	3.151	162.47
2048	512	8192	0.405	5054.63	3.371	151.90
2048	512	10240	0.432	4742.62	3.618	141.52
2048	512	12288	0.458	4471.08	3.804	134.59
2048	512	14336	0.486	4210.13	4.067	125.90

💬 Conversation

👤 ubergarm commented the 2025-06-05 at 15:26:27:

Amazing, you've done it! The pieces of the puzzle are in place. Congrats, ik, on the world's smallest working DeepSeek-R1-0528 quant! 🎉

With the new DDR5 2x64GB DIMM kits becoming available, an AM5 gaming class rig + GPU can barely fit this little beast!

I'm going to double check that llama-perplexity still runs clean, but great speed with partial offload is now working!

👈 Commands and Logs

Pull and Build

git branch | grep '*'
* ik/cuda_iq1_m_r4

git rev-parse --short HEAD
8ed7825f

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)

llama-sweep-bench

model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 16384 \
  -ctk f16 \
  -mla 3 -fa \
  -amb 512 \
  -fmoe \
  -ngl 99 \
  -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA0" \
  -ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*=CUDA1" \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  --warmup-batch \
  --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_0:   61 tensors
llama_model_loader: - type iq4_ks:  551 tensors
llama_model_loader: - type iq1_s_r4:  116 tensors
llama_model_loader: - type iq1_m_r4:   58 tensors

llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ1_S_R4 - 1.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 130.203 GiB (1.664 BPW)
llm_load_print_meta: repeating layers = 129.285 GiB (1.657 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek R1 0528

llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =  5994.06 MiB
llm_load_tensors:        CPU buffer size = 44211.82 MiB
llm_load_tensors:        CPU buffer size =   469.99 MiB
llm_load_tensors:      CUDA0 buffer size = 42859.65 MiB
llm_load_tensors:      CUDA1 buffer size = 43061.37 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =   576.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   522.00 MiB
llama_new_context_with_model: KV self size  = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  2824.02 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2520.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   368.05 MiB
llama_new_context_with_model: graph nodes  = 5500
llama_new_context_with_model: graph splits = 111

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	9.959	411.28	70.744	14.47
4096	1024	4096	12.460	328.73	73.277	13.97
4096	1024	8192	14.947	274.04	76.418	13.40
4096	1024	12288	17.442	234.84	78.654	13.02

5.8 KiB Raw Blame History