Files
ik_llama.cpp/github-data/pull_requests/344 - Add GLM-4-0414 Model Support.md
2025-07-23 13:31:53 +02:00

65 KiB
Raw Blame History

🔀 #344 - Add GLM-4-0414 Model Support

Author ubergarm
State Closed
Created 2025-04-24
Updated 2025-05-08

Description

This is my second attempt which still has some issues. Original attempt was #333. This one is based on https://github.com/ggml-org/llama.cpp/pull/12867 . However, this PR does not bring over any of the python stuff.

In limited testing with of bartowski/THUDM_GLM-Z1-32B-0414-GGUF on CPU only and CUDA backends it seems to work as long as:

  1. Flash Attention must be explicitly enabled e.g. -fa
  2. If using CUDA, cannot offload >= 60 layers. (works up to -ngl 59).

Example Command

This is one way to run it on CUDA that seems to work:

./build/bin/llama-server \
    --alias "bartowski/THUDM_GLM-Z1-32B-0414-IQ4_XS" \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa \
    --ctx-size 8192 \
    --n-gpu-layers 59 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080

If I increase --n-gpu-layers 60 or higher, it outputs GGGGGGGGGGGGGGG.

It also seems okay to add -amb 512 -ctk q8_0 -ctv q8_0...

fwiw there seems to be some issues still on mainline implementation possibly related:

So I'll mark this as draft for now and see how things are looking soon.


💬 Conversation

👤 ikawrakow commented the 2025-04-25 at 07:29:50:

If I increase --n-gpu-layers 60 or higher, it outputs GGGGGGGGGGGGGGG.

Does it also happen when you use -ctk q8_0 -ctv q8_0? There is this PR in mainline where they want to force f32 for cuBLAS matrix multiplications (those get used for attention calculations when KV cache is f16) to get meaningful results out of GLM-4. This indicates that f16 may not have enough range to accommodate the GLM-4 numerical range. In such cases using quantized cache may help.


👤 ubergarm commented the 2025-04-25 at 14:37:32:

Hrrm, unfortunately no using -ctk q8_0 -ctv q8_0 with -ngl 60 (or higher) still throws GGGGGGGG...

Without -fa and under <60 layers offload it looks like this: 转 Cuomo. кури....bl的话 the", E neuronal,,-T...<2E> l -氏 Blarnalc见<63>.flow总 in杯商house^C

I also tried city96's patch to force f32 dtype as shown below, but that didn't seem to fix this issue either:

--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9188,6 +9188,10 @@ static struct ggml_tensor * llm_build_ffn(

     if (down) {
         cur = llm_build_lora_mm(lctx, ctx, down, cur);
+        if (lctx.model.arch == LLM_ARCH_GLM4) {
+            // GLM4 seems to have numerical issues with half-precision accumulators
+            ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
+        }
     }

Could be that I made a mistake in the build_glm4() the attention cgraph? Interestingly this invocation seems to works fine too:

./build/bin/llama-server \
    --alias "bartowski/THUDM_GLM-Z1-32B-0414-IQ4_XS" \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa \
    --ctx-size 8192 \
    --n-gpu-layers 99 \
    -ot attn=CPU \
    -nkvo \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080

Last observations are that mainline seems to work fine with or without -fa and also mainline is much slower even fully offloaded e.g. 20 tok/sec PP and 5 tok/s TG. Compared to ik_llama.cp getting 163 tok/sec PP and 17 tok/sec TG with -ot attn=CPU -nkvo and even faster at 271 tok/sec PP and 25 tok/sec TG with -ngl 59...

Not sure what to try next other than dig in deeper to how build_inp_KQ_mask() and llm_build_kv have changed with mainline refactors or something...


👤 ikawrakow commented the 2025-04-25 at 14:43:20:

Could be that I made a mistake in the build_glm4() the attention cgraph? Interestingly this invocation seems to works fine too:

If you make a mistake with building the graph, this invocation wouldn't be working. If it works for all layers offloaded to the GPU except attention tensors and KV cache, it means there is a precision issue in the attention calculation on CUDA (on the CPU everything is computed with fp32 precision).


👤 ubergarm commented the 2025-04-25 at 14:48:42:

I just noticed one more odd thing trying -ot attn=CPU -ot .*=CUDA0 on ik_llama.cpp it prints this out on startup then crashes. There are two __missing__ types per layer it seems...

Tensor token_embd.weight buffer type overriden to CPU
Tensor output_norm.weight buffer type overriden to CPU
Tensor output.weight buffer type overriden to CPU
Tensor blk.0.attn_norm.weight buffer type overriden to CPU
Tensor __missing__ buffer type overriden to CPU
Tensor __missing__ buffer type overriden to CPU
Tensor blk.0.attn_q.weight buffer type overriden to CPU
Tensor blk.0.attn_k.weight buffer type overriden to CPU
Tensor blk.0.attn_v.weight buffer type overriden to CPU
Tensor blk.0.attn_q.bias buffer type overriden to CPU
Tensor blk.0.attn_k.bias buffer type overriden to CPU
Tensor blk.0.attn_v.bias buffer type overriden to CPU
Tensor blk.0.attn_output.weight buffer type overriden to CPU
Tensor blk.0.post_attention_norm.weight buffer type overriden to CPU
Tensor blk.0.ffn_norm.weight buffer type overriden to CPU
Tensor blk.0.ffn_down.weight buffer type overriden to CPU
Tensor blk.0.ffn_up.weight buffer type overriden to CPU
Tensor blk.0.post_ffw_norm.weight buffer type overriden to CPU

👤 ikawrakow commented the 2025-04-25 at 14:50:36:

Try this: in the function llm_build_kqv(), on all lines that have

if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX etc

add || model.arch == LLM_ARCH_GLM4.

This will set the precision of the K*Q calculation to fp32, and hopefully fix the issue when all layers are offloaded to the GPU.


👤 ikawrakow commented the 2025-04-25 at 14:57:27:

I see in mainline llama.cpp they have become tired of setting the K*Q calculation for fp32 precision for specific models, and now have this

        ggml_tensor * kq = ggml_mul_mat(ctx0, k, q);

        // note: this op tends to require high floating point range
        //       while for some models F16 is enough, for others it is not, so we default to F32 here
        ggml_mul_mat_set_prec(kq, GGML_PREC_F32);

This is why mainline may be working for this model. I still refuse to set that generically for all models as this hurts performance for long contexts quite a bit. The downside is that one needs to explicitly enable fp32 precision when necessary for a model.


👤 ubergarm commented the 2025-04-25 at 15:01:49:

add || model.arch == LLM_ARCH_GLM4

Yes, this fixed the issue, I can fully offload now!

--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9415,7 +9415,7 @@ static struct ggml_tensor * llm_build_kqv(
         // For DeepSeek-2, it is perfectly fine with fp16 for PP, but I get gibberish when uding fp16 for TG.
         // Not sure if it is really a matter of insufficient precision, or I have made a mistake in the fattn-vec-f16 kernel.
         if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX ||
-            (model.arch == LLM_ARCH_DEEPSEEK2 && q->ne[1] <= 8)) {
+            model.arch == LLM_ARCH_GLM4 || (model.arch == LLM_ARCH_DEEPSEEK2 && q->ne[1] <= 8)) {
             ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
         }

I'll push this up.

Remaining questions:

  • Is it okay that it does not work without -fa ?
  • I didn't test on other hardware nor include the latest mainline patch ggml_mul_mat_set_prec(cur, GGML_PREC_F32);.

👤 ubergarm commented the 2025-04-25 at 15:35:45:

Okay, so now without -fa it no longer produces GGGGGGG but it is back to this kinda stuff:

arsTab<EFBFBD>.^rellsúng pacirc Pepper九龙每:室hlt一层avit学isi<73> cé个义项PMC\":列为<E58897> friAZalyrátolpanies<65>formanceInvoke9不足 Cornel Naz/Rkoz<6F>koz<6F>INFedomaidaporaidariantchartôaid

I'll look for a reference, I thought I've seen others mentioning this kinda output before.


👤 ikawrakow submitted a review the 2025-04-25 at 16:58:38: 💬 COMMENTED


👤 ikawrakow commented during a code review the 2025-04-25 at 16:58:38 on src/llama.cpp:

Add

                if (model.arch == LLM_ARCH_GLM4) {
                    ggml_mul_mat_set_prec(kqv_i, GGML_PREC_F32);
                }

after line 9515


👤 ikawrakow submitted a review the 2025-04-25 at 17:01:07: 💬 COMMENTED


👤 ikawrakow commented during a code review the 2025-04-25 at 17:01:07 on src/llama.cpp:

Add

if ( model.arch == LLM_ARCH_GLM4) {
     ggml_mul_mat_set_prec(kqv, GGML_PREC_F32);
}

after line 9475


👤 ikawrakow commented the 2025-04-25 at 17:07:32:

I don't think any of the suggestions you are finding around the Internet are going to help. Just think about it:

  • It works on the CPU (calculation done with fp32)
  • It works with FA after setting precision to fp32
  • You set the precision of the K*Q matrix multiplication to fp32 and it improved things, but did not fix. Getting GGGGG... basically means there are NaNs in the result. Getting gibberish output means the values are finite but not meaningful.

The only logical conclusion from these 3 observations is that you also need to set the precision to fp32 for the kqv = V*softmax(K*Q) matrix multiplication. The other option would be that things go wrong in the softmax calculation on CUDA. But looking at the CUDA softmax implementation, it is already done using fp32 arithmetic. Hence, it must be the kqv matrix multiplication.


👤 ubergarm commented the 2025-04-25 at 18:31:20:

Thanks, I appreciate you helping me learn on this.

Just to be clear I'm getting the gibberish output without -fa on both CPU only as well as CUDA backend.

I tried setting precision to fp32 as you describe, but still get the same gibberish.

The patch you suggested above. I went ahead and tried this and it seems to be taking the `kqv` path and not the `kqv_i` but still giving same gibberish. ``` --- a/src/llama.cpp +++ b/src/llama.cpp @@ -9473,6 +9473,9 @@ static struct ggml_tensor * llm_build_kqv( GGML_ASSERT(kv.size == n_ctx);
         struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
  •        if (model.arch == LLM_ARCH_GLM4) {
    
  •            ggml_mul_mat_set_prec(kqv, GGML_PREC_F32);
    
  •        }
           cb(kqv, "kqv", il);
    
           struct ggml_tensor * kqv_merged = ggml_permute(ctx, kqv, 0, 2, 1, 3);
    

@@ -9513,6 +9516,9 @@ static struct ggml_tensor * llm_build_kqv( i02 = i12 / r2v; auto v_i = ggml_view_3d(ctx, v, v->ne[0], v->ne[1], this_ne12, v->nb[1], v->nb[2], v->nb[2]*i02); auto kqv_i = ggml_mul_mat(ctx, v_i, kq_i);

  •            if (model.arch == LLM_ARCH_GLM4) {
    
  •                ggml_mul_mat_set_prec(kqv_i, GGML_PREC_F32);
    
  •            }
               if (i12 == 0) {
                   kqv = kqv_i;
               } else {
    
</details>

I'll dig into the differences between mainline non flash attention and this forks non flash attention path more to see if anything else sticks out to me.

---

👤 **ikawrakow** commented the **2025-04-26** at **06:02:40**:<br>

> Just to be clear I'm getting the gibberish output without -fa on both CPU only as well as CUDA backend.

Sorry, I missed the fact that it is not working on the CPU without FA. If I had paid better attention, I would have diagnosed the problem much earlier.

Simply remove the line (line 15686 in the version I just cloned from your repository)
```c++
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

In mainline they have reorganized how attention is built. Reshaping V to 3D at this point fits with their build_attn function, but not with the way things are done here (and were formerly done in mainline). I tested and it works!


👤 ikawrakow commented the 2025-04-26 at 07:19:17:

Here a quick CPU only sweep-bench performance comparison to mainline for the bartowski/THUDM_GLM-Z1-32B-0414-GGUF model you are using

Mainline

./bin/llama-sweep-bench -m THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf -c 8192 -t 32 -fa -ctk q8_0 -ctv q8_0
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 19.729 25.95 30.953 4.14
512 128 512 20.930 24.46 31.639 4.05
512 128 1024 22.138 23.13 32.156 3.98
512 128 1536 23.310 21.96 32.627 3.92
512 128 2048 24.451 20.94 33.047 3.87
512 128 2560 25.607 19.99 33.452 3.83
512 128 3072 26.732 19.15 33.765 3.79
512 128 3584 27.819 18.40 34.119 3.75
512 128 4096 28.965 17.68 34.460 3.71
512 128 4608 30.076 17.02 34.823 3.68
512 128 5120 31.207 16.41 35.184 3.64
512 128 5632 32.371 15.82 35.544 3.60
512 128 6144 33.485 15.29 35.917 3.56
512 128 6656 34.627 14.79 36.275 3.53
512 128 7168 35.749 14.32 36.641 3.49
512 128 7680 36.891 13.88 37.006 3.46

ik_llama.cpp

./bin/llama-sweep-bench -m THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf -c 8192 -t 32 -fa -ctk q8_0 -ctv q8_0 -rtr

(but I needed the changes in PR #349 to make FA work on the CPU).

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 7.275 70.38 30.690 4.17
512 128 512 7.445 68.77 31.104 4.12
512 128 1024 7.608 67.30 31.206 4.10
512 128 1536 7.778 65.83 31.421 4.07
512 128 2048 7.929 64.57 31.559 4.06
512 128 2560 8.087 63.31 31.746 4.03
512 128 3072 8.243 62.11 31.883 4.01
512 128 3584 8.405 60.91 32.053 3.99
512 128 4096 8.545 59.92 32.169 3.98
512 128 4608 8.706 58.81 32.351 3.96
512 128 5120 8.855 57.82 32.398 3.95
512 128 5632 9.025 56.73 32.591 3.93
512 128 6144 9.164 55.87 32.655 3.92
512 128 6656 9.316 54.96 32.838 3.90
512 128 7168 9.476 54.03 32.902 3.89
512 128 7680 9.635 53.14 33.091 3.87

👤 ubergarm commented the 2025-04-26 at 14:53:00:

Sweeet that fixes up the non-flash-attention case! This model is quite efficient, I just ran it with 128k context and only using 21194MiB VRAM ?? Looking forward to some testing and benchmarking soon.

For now I'll fixup this PR and put it after your recent cohere2 additions and set it to review ready afterwards.

Thanks again really appreciate your time looking at this! Cheers!


👤 ubergarm commented the 2025-04-26 at 15:23:46:

Okay got it rebased, gonna force push it up after quick final test!!!


👤 ikawrakow submitted a review the 2025-04-26 at 15:33:46: APPROVED


👤 ubergarm commented the 2025-04-26 at 15:41:12:

Yaay!! Feels good to finally get that model working haha... Thanks again for your patience and guidance! Have a g'night!


👤 ubergarm commented the 2025-04-26 at 20:04:37:

Here a quick CPU only sweep-bench performance comparison to mainline for the bartowski/THUDM_GLM-Z1-32B-0414-GGUF model you are using

I followed your lead and ran some llama-sweep-bench comparisons too. My CPU-only benchmarks line up with yours, but my GPU results surprised me and didn't look as good assuming my quick fixup of @saood06's llama-sweep-bench back to mainline isn't introducing some issue.

ik's CPU-only test

thud-sweep-03-ik-CPU

my CPU-only test

thud-sweep-01-CPU

Logs

llama.cpp@558a76

Plus github.com/ubergarm/llama.cpp ug/port-sweep-bench branch.

$ ./build/bin/llama-sweep-bench \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa -ctk q8_0 -ctv q8_0 \
    -c 5120 \
    --no-mmap \
    --threads 16

build: 5192 (e59a5f1e) with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
llama_model_loader: loaded meta data with 37 key-value pairs and 613 tensors from /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GLM Z1 32B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 61
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 6144
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 23040
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 48
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 30
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/GLM-Z1-32B-0414-GGUF/THUD...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 366
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  245 tensors
llama_model_loader: - type q5_K:   61 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  306 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 16.38 GiB (4.32 BPW)
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 14
load: token to piece cache size = 0.9710 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 6144
print_info: n_layer          = 61
print_info: n_head           = 48
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 24
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 23040
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.57 B
print_info: general.name     = GLM Z1 32B 0414
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/62 layers to GPU
load_tensors:          CPU model buffer size = 16775.23 MiB
...............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 5120
llama_context: n_ctx_per_seq = 5120
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (5120) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 5120, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 61, can_shift = 1
init:        CPU KV buffer size =   162.03 MiB
llama_context: KV self size  =  162.03 MiB, K (q8_0):   81.02 MiB, V (q8_0):   81.02 MiB
llama_context:        CPU compute buffer size =   308.00 MiB
llama_context: graph nodes  = 2264
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 5120
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |


main: n_kv_max = 5120, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   25.444 |    20.12 |   25.742 |     4.97 |
|   512 |    128 |    512 |   28.640 |    17.88 |   26.082 |     4.91 |
|   512 |    128 |   1024 |   33.622 |    15.23 |   26.430 |     4.84 |
|   512 |    128 |   1536 |   39.245 |    13.05 |   27.190 |     4.71 |
|   512 |    128 |   2048 |   45.237 |    11.32 |   27.152 |     4.71 |
|   512 |    128 |   2560 |   51.249 |     9.99 |   27.521 |     4.65 |
|   512 |    128 |   3072 |   57.110 |     8.97 |   27.905 |     4.59 |
|   512 |    128 |   3584 |   62.143 |     8.24 |   28.275 |     4.53 |
|   512 |    128 |   4096 |   67.889 |     7.54 |   28.630 |     4.47 |
|   512 |    128 |   4608 |   72.920 |     7.02 |   29.034 |     4.41 |

ik_llama.cpp@baeefb47

$ ./build/bin/llama-sweep-bench \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -rtr -fa -ctk q8_0 -ctv q8_0 \
    -c 5120 \
    --threads 16

llama_model_loader: loaded meta data with 37 key-value pairs and 613 tensors from /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GLM Z1 32B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 61
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 6144
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 23040
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 48
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 30
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/GLM-Z1-32B-0414-GGUF/THUD...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 366
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  245 tensors
llama_model_loader: - type q5_K:   61 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  306 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9710 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = glm4
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 318088
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 24
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 23040
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 32.566 B
llm_load_print_meta: model size       = 16.382 GiB (4.321 BPW)
llm_load_print_meta: repeating layers = 15.210 GiB (4.255 BPW, 30.704 B parameters)
llm_load_print_meta: general.name     = GLM Z1 32B 0414
llm_load_print_meta: BOS token        = 151331 '[gMASK]'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors:        CPU buffer size = 16775.23 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 5120
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   162.03 MiB
llama_new_context_with_model: KV self size  =  162.03 MiB, K (q8_0):   81.02 MiB, V (q8_0):   81.02 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   308.00 MiB
llama_new_context_with_model: graph nodes  = 1592
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 5120, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

============ Repacked 367 tensors
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 6.188 82.74 25.659 4.99
512 128 512 6.286 81.46 25.741 4.97
512 128 1024 6.383 80.21 25.814 4.96
512 128 1536 6.478 79.04 25.871 4.95
512 128 2048 6.559 78.06 25.941 4.93
512 128 2560 6.651 76.98 26.026 4.92
512 128 3072 6.734 76.03 26.051 4.91
512 128 3584 6.815 75.12 26.110 4.90
512 128 4096 6.902 74.18 26.160 4.89
512 128 4608 7.007 73.07 26.232 4.88

my CUDA GPU test

thud-sweep-02-GPU

Logs

llama.cpp@558a76

Plus github.com/ubergarm/llama.cpp ug/port-sweep-bench branch.

$ CUDA_VISIBLE_DEVICE=0 \
./build/bin/llama-sweep-bench \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 16

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 5192 (e59a5f1e) with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 22895 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 613 tensors from /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GLM Z1 32B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 61
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 6144
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 23040
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 48
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 30
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/GLM-Z1-32B-0414-GGUF/THUD...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 366
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  245 tensors
llama_model_loader: - type q5_K:   61 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  306 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 16.38 GiB (4.32 BPW)
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 14
load: token to piece cache size = 0.9710 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 6144
print_info: n_layer          = 61
print_info: n_head           = 48
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 24
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 23040
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.57 B
print_info: general.name     = GLM Z1 32B 0414
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 16303.48 MiB
load_tensors:   CPU_Mapped model buffer size =   471.75 MiB
...............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 1
init:      CUDA0 KV buffer size =  1952.00 MiB
llama_context: KV self size  = 1952.00 MiB, K (f16):  976.00 MiB, V (f16):  976.00 MiB
llama_context:      CUDA0 compute buffer size =   353.00 MiB
llama_context:  CUDA_Host compute buffer size =    76.01 MiB
llama_context: graph nodes  = 2264
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |


main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    0.375 |  1365.61 |    3.339 |    38.33 |
|   512 |    128 |    512 |    0.377 |  1356.98 |    3.373 |    37.94 |
|   512 |    128 |   1024 |    0.383 |  1337.96 |    3.386 |    37.80 |
|   512 |    128 |   1536 |    0.389 |  1316.12 |    3.426 |    37.36 |
|   512 |    128 |   2048 |    0.395 |  1296.18 |    3.419 |    37.44 |
|   512 |    128 |   2560 |    0.400 |  1280.80 |    3.444 |    37.17 |
|   512 |    128 |   3072 |    0.405 |  1265.46 |    3.457 |    37.03 |
|   512 |    128 |   3584 |    0.410 |  1248.46 |    3.475 |    36.84 |
|   512 |    128 |   4096 |    0.416 |  1229.54 |    3.488 |    36.70 |
|   512 |    128 |   4608 |    0.422 |  1212.10 |    3.504 |    36.53 |
|   512 |    128 |   5120 |    0.428 |  1197.32 |    3.520 |    36.37 |
|   512 |    128 |   5632 |    0.433 |  1181.44 |    3.538 |    36.18 |
|   512 |    128 |   6144 |    0.438 |  1168.89 |    3.553 |    36.03 |
|   512 |    128 |   6656 |    0.444 |  1154.19 |    3.567 |    35.89 |
|   512 |    128 |   7168 |    0.449 |  1141.06 |    3.616 |    35.40 |
|   512 |    128 |   7680 |    0.454 |  1126.73 |    3.625 |    35.31 |
|   512 |    128 |   8192 |    0.460 |  1114.03 |    3.755 |    34.09 |
|   512 |    128 |   8704 |    0.466 |  1098.92 |    3.668 |    34.90 |
|   512 |    128 |   9216 |    0.471 |  1088.14 |    3.668 |    34.90 |
|   512 |    128 |   9728 |    0.476 |  1076.44 |    3.671 |    34.86 |
|   512 |    128 |  10240 |    0.482 |  1062.75 |    3.676 |    34.82 |
|   512 |    128 |  10752 |    0.487 |  1051.19 |    3.687 |    34.72 |
|   512 |    128 |  11264 |    0.491 |  1042.43 |    3.692 |    34.67 |
|   512 |    128 |  11776 |    0.505 |  1013.60 |    3.720 |    34.41 |
|   512 |    128 |  12288 |    0.504 |  1014.87 |    3.784 |    33.82 |
|   512 |    128 |  12800 |    0.539 |   950.02 |    3.833 |    33.39 |
|   512 |    128 |  13312 |    0.516 |   991.65 |    3.909 |    32.74 |
|   512 |    128 |  13824 |    0.522 |   981.09 |    3.873 |    33.05 |
|   512 |    128 |  14336 |    0.539 |   949.82 |    4.010 |    31.92 |
|   512 |    128 |  14848 |    0.569 |   899.85 |    3.995 |    32.04 |
|   512 |    128 |  15360 |    0.534 |   958.25 |    3.950 |    32.40 |
|   512 |    128 |  15872 |    0.539 |   949.11 |    3.824 |    33.47 |
|   512 |    128 |  16384 |    0.547 |   936.62 |    3.832 |    33.41 |
|   512 |    128 |  16896 |    0.555 |   922.31 |    3.827 |    33.45 |
|   512 |    128 |  17408 |    0.559 |   915.85 |    3.858 |    33.18 |
|   512 |    128 |  17920 |    0.561 |   913.36 |    3.847 |    33.27 |
|   512 |    128 |  18432 |    0.567 |   902.43 |    3.863 |    33.13 |
|   512 |    128 |  18944 |    0.571 |   895.97 |    3.864 |    33.12 |
|   512 |    128 |  19456 |    0.575 |   891.16 |    3.899 |    32.83 |
|   512 |    128 |  19968 |    0.580 |   882.92 |    3.857 |    33.18 |
|   512 |    128 |  20480 |    0.585 |   875.26 |    3.863 |    33.14 |
|   512 |    128 |  20992 |    0.590 |   867.25 |    3.871 |    33.07 |
|   512 |    128 |  21504 |    0.595 |   860.29 |    3.917 |    32.68 |
|   512 |    128 |  22016 |    0.600 |   853.53 |    3.921 |    32.64 |
|   512 |    128 |  22528 |    0.605 |   846.56 |    3.927 |    32.60 |
|   512 |    128 |  23040 |    0.609 |   840.50 |    3.931 |    32.56 |
|   512 |    128 |  23552 |    0.615 |   832.38 |    3.941 |    32.48 |
|   512 |    128 |  24064 |    0.620 |   825.45 |    3.945 |    32.44 |
|   512 |    128 |  24576 |    0.626 |   818.16 |    3.948 |    32.42 |
|   512 |    128 |  25088 |    0.630 |   812.67 |    3.956 |    32.36 |
|   512 |    128 |  25600 |    0.637 |   804.33 |    3.962 |    32.31 |
|   512 |    128 |  26112 |    0.640 |   800.21 |    3.967 |    32.26 |
|   512 |    128 |  26624 |    0.646 |   792.11 |    3.974 |    32.21 |
|   512 |    128 |  27136 |    0.650 |   787.81 |    3.984 |    32.13 |
|   512 |    128 |  27648 |    0.656 |   781.05 |    3.989 |    32.09 |
|   512 |    128 |  28160 |    0.663 |   771.82 |    4.086 |    31.33 |
|   512 |    128 |  28672 |    0.665 |   769.50 |    4.039 |    31.69 |
|   512 |    128 |  29184 |    0.671 |   763.01 |    4.043 |    31.66 |
|   512 |    128 |  29696 |    0.676 |   757.73 |    4.051 |    31.60 |
|   512 |    128 |  30208 |    0.680 |   752.57 |    4.054 |    31.58 |
|   512 |    128 |  30720 |    0.686 |   746.34 |    4.065 |    31.49 |
|   512 |    128 |  31232 |    0.690 |   741.72 |    4.067 |    31.47 |
|   512 |    128 |  31744 |    0.697 |   734.83 |    4.074 |    31.42 |
|   512 |    128 |  32256 |    0.701 |   730.49 |    4.083 |    31.35 |

ik_llama.cpp@baeefb47

CUDA_VISIBLE_DEVICE=0 \
./build/bin/llama-sweep-bench \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 16

llama_model_loader: loaded meta data with 37 key-value pairs and 613 tensors from /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GLM Z1 32B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 61
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 6144
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 23040
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 48
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 30
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/GLM-Z1-32B-0414-GGUF/THUD...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 366
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  245 tensors
llama_model_loader: - type q5_K:   61 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  306 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9710 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = glm4
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 318088
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 24
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 23040
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 32.566 B
llm_load_print_meta: model size       = 16.382 GiB (4.321 BPW)
llm_load_print_meta: repeating layers = 15.210 GiB (4.255 BPW, 30.704 B parameters)
llm_load_print_meta: general.name     = GLM Z1 32B 0414
llm_load_print_meta: BOS token        = 151331 '[gMASK]'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.56 MiB
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =   471.75 MiB
llm_load_tensors:      CUDA0 buffer size = 16303.48 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1952.00 MiB
llama_new_context_with_model: KV self size  = 1952.00 MiB, K (f16):  976.00 MiB, V (f16):  976.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   308.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    76.01 MiB
llama_new_context_with_model: graph nodes  = 1592
llama_new_context_with_model: graph splits = 2

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.333 1538.22 3.212 39.85
512 128 512 0.343 1492.07 3.276 39.07
512 128 1024 0.354 1447.26 3.339 38.34
512 128 1536 0.363 1410.13 3.398 37.67
512 128 2048 0.373 1373.33 3.456 37.03
512 128 2560 0.384 1332.96 3.523 36.33
512 128 3072 0.394 1298.19 3.583 35.72
512 128 3584 0.405 1265.49 3.640 35.17
512 128 4096 0.415 1233.24 3.697 34.62
512 128 4608 0.426 1202.42 3.754 34.10
512 128 5120 0.436 1174.72 3.820 33.51
512 128 5632 0.446 1147.10 3.876 33.02
512 128 6144 0.457 1119.58 3.931 32.56
512 128 6656 0.468 1094.31 3.987 32.11
512 128 7168 0.477 1073.21 4.042 31.67
512 128 7680 0.487 1050.31 4.098 31.23
512 128 8192 0.500 1023.63 4.154 30.82
512 128 8704 0.511 1002.28 4.222 30.32
512 128 9216 0.521 982.66 4.278 29.92
512 128 9728 0.531 963.76 4.335 29.53
512 128 10240 0.541 946.41 4.391 29.15
512 128 10752 0.551 928.41 4.445 28.80
512 128 11264 0.561 912.12 4.502 28.43
512 128 11776 0.570 897.92 4.555 28.10
512 128 12288 0.579 883.61 4.612 27.76
512 128 12800 0.590 867.46 4.667 27.43
512 128 13312 0.601 852.49 4.720 27.12
512 128 13824 0.610 839.39 4.776 26.80
512 128 14336 0.621 824.14 4.828 26.51
512 128 14848 0.631 811.64 4.885 26.20
512 128 15360 0.642 797.72 4.934 25.94
512 128 15872 0.652 785.82 4.989 25.66
512 128 16384 0.662 773.33 5.043 25.38
512 128 16896 0.672 762.26 5.099 25.10
512 128 17408 0.681 751.45 5.153 24.84
512 128 17920 0.692 740.14 5.206 24.59
512 128 18432 0.702 729.58 5.260 24.33
512 128 18944 0.711 719.91 5.313 24.09
512 128 19456 0.720 710.66 5.371 23.83
512 128 19968 0.731 700.28 5.423 23.60
512 128 20480 0.740 691.88 5.482 23.35
512 128 20992 0.750 682.74 5.536 23.12
512 128 21504 0.761 673.13 5.591 22.89
512 128 22016 0.770 664.83 5.641 22.69
512 128 22528 0.781 655.60 5.699 22.46
512 128 23040 0.790 648.12 5.749 22.26
512 128 23552 0.800 639.76 5.804 22.05
512 128 24064 0.811 631.16 5.860 21.84
512 128 24576 0.820 624.55 5.915 21.64
512 128 25088 0.830 616.63 5.970 21.44
512 128 25600 0.840 609.34 6.028 21.24
512 128 26112 0.850 602.01 6.084 21.04
512 128 26624 0.860 595.01 6.139 20.85
512 128 27136 0.870 588.30 6.197 20.66
512 128 27648 0.880 582.14 6.251 20.48
512 128 28160 0.890 575.38 6.308 20.29
512 128 28672 0.900 569.14 6.361 20.12
512 128 29184 0.912 561.64 6.416 19.95
512 128 29696 0.920 556.31 6.472 19.78
512 128 30208 0.930 550.65 6.527 19.61
512 128 30720 0.940 544.53 6.586 19.44
512 128 31232 0.951 538.41 6.633 19.30
512 128 31744 0.961 532.89 6.693 19.12
512 128 32256 0.970 527.77 6.744 18.98

I didn't yet try comparing running with non-flash-attention.


👤 saood06 commented the 2025-04-27 at 08:48:11:

This model is quite efficient, I just ran it with 128k context and only using 21194MiB VRAM ??

Yes, it has a very high GQA factor of 24

This caught my eye, and was glad they had a prior work dedicated to long context training of LLMs, that they referenced in the GQA part of their technical report, LongAlign: A Recipe for Long Context Alignment of Large Language Models


👤 saood06 commented the 2025-05-08 at 22:44:40:

I found this where someone uses NoLiMa to test the long context performance and they did notice lower performance (which I believe is because of the very high GQA factor).