Files
ik_llama.cpp/github-data/pull_requests/328 - imatrix_ collect layer influence statistics.md
2025-07-23 13:31:53 +02:00

23 KiB

🔀 #328 - imatrix: collect layer influence statistics

Author ikawrakow
State Closed
Created 2025-04-14
Updated 2025-04-14

Description

@ubergarm

Here is how one can collect statistics about the activations change caused by a layer using cosine similarity.


💬 Conversation

👤 ubergarm commented the 2025-04-14 at 14:39:20:

Holy smokes, amazing! I'm out for a couple nights, but going to pull this and try quick before leaving the house haha... Thanks!


👤 ikawrakow commented the 2025-04-14 at 16:02:02:

Does the last commit fix it? I had forgotten about having to strip the tensor name (and for whatever reason I didn't have the issue even though running on CUDA).


👤 ubergarm commented the 2025-04-14 at 16:10:14:

Yep, that did the trick! Thanks! I have a chart I just graphed, will put it here with logs before heading out.


👤 ikawrakow commented the 2025-04-14 at 16:13:51:

Using this on LLaMA-4-Scout, I get this as the layers sorted by importance (most important first):

======================== sorted layer importances
  0: Layer   0, <cos_sim> = 0.147234
  1: Layer   2, <cos_sim> = 0.338908
  2: Layer  47, <cos_sim> = 0.413196
  3: Layer   1, <cos_sim> = 0.626674
  4: Layer   7, <cos_sim> = 0.835974
  5: Layer   6, <cos_sim> = 0.841949
  6: Layer   4, <cos_sim> = 0.844908
  7: Layer   3, <cos_sim> = 0.849444
  8: Layer  10, <cos_sim> = 0.869448
  9: Layer  34, <cos_sim> = 0.875514
 10: Layer  22, <cos_sim> = 0.880165
 11: Layer  46, <cos_sim> = 0.881091
 12: Layer  11, <cos_sim> = 0.887115
 13: Layer  31, <cos_sim> = 0.889579
 14: Layer  35, <cos_sim> = 0.893048
 15: Layer  26, <cos_sim> = 0.897382
 16: Layer  18, <cos_sim> = 0.898017
 17: Layer  23, <cos_sim> = 0.898672
 18: Layer  21, <cos_sim> = 0.900372
 19: Layer  14, <cos_sim> = 0.902133
 20: Layer  43, <cos_sim> = 0.908545
 21: Layer  44, <cos_sim> = 0.908824
 22: Layer  38, <cos_sim> = 0.909535
 23: Layer  45, <cos_sim> = 0.909808
 24: Layer  19, <cos_sim> = 0.911718
 25: Layer   8, <cos_sim> = 0.911922
 26: Layer  30, <cos_sim> = 0.913816
 27: Layer  13, <cos_sim> = 0.916391
 28: Layer  39, <cos_sim> = 0.917897
 29: Layer  25, <cos_sim> = 0.917991
 30: Layer  24, <cos_sim> = 0.918002
 31: Layer  27, <cos_sim> = 0.918821
 32: Layer   5, <cos_sim> = 0.920709
 33: Layer  15, <cos_sim> = 0.921429
 34: Layer   9, <cos_sim> = 0.922202
 35: Layer  29, <cos_sim> = 0.923448
 36: Layer  16, <cos_sim> = 0.924396
 37: Layer  17, <cos_sim> = 0.925231
 38: Layer  42, <cos_sim> = 0.925237
 39: Layer  12, <cos_sim> = 0.926379
 40: Layer  37, <cos_sim> = 0.926797
 41: Layer  20, <cos_sim> = 0.92796
 42: Layer  28, <cos_sim> = 0.933169
 43: Layer  36, <cos_sim> = 0.936506
 44: Layer  32, <cos_sim> = 0.936671
 45: Layer  41, <cos_sim> = 0.939215
 46: Layer  33, <cos_sim> = 0.940524
 47: Layer  40, <cos_sim> = 0.948523

I had a pretty good L4-Scout recipe for IQ2_K

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=iq3_k,ffn_up_exps=iq2_k,ffn_gate_exps=iq2_k" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq2_k

It arrived at a PPL = 9.7545, so nearly on par with Unsloth's UD-Q2_K_XL, despite being 2.6 GB smaller. The recipe uses IQ4_KS for the first 6 layers of ffn_down_exps. If instead I use layers 0,1,2,4,6,7, PPL becomes 9.7066, so we do get a small improvement from that (but using layer 47 instead of layer 4, which according to the metric would be the right thing to do, results in a worse outcome)


👤 ubergarm commented the 2025-04-14 at 16:28:24:

(but using layer 47 instead of layer 4, which according to the metric would be the right thing to do, results in a worse outcome)

Very interesting. Yeah, I'm curious how much the input text for imatrix effects these cosine similarities as well.

I did a quick run with llama-2-13b-chat.Q8_0.gguf and plotted the results to compare against that Layer-wise Quantization paper which suggests for this model the three most important layers would be 1, 2, and 40 while the least important would be 32, 33, and 34. Though I'm not sure how they got that final layer 40 cosine similarity.

Results Graph and Log of modified llama-imatrix -lsim

lsim

$ git branch | grep '*'
* ik/imatrix_lsim

$ git rev-parse --short HEAD
8bff04c9

$ ./build/bin/llama-imatrix --version
version: 3638 (8bff04c9)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

$ ./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /mnt/astrodata/llm/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q8_0.gguf \
    -f calibration_data_v5_rc.txt \
    -o imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat \
    --layer-similarity \
    --output-tensor-name ffn_down.weight \
    --ctx-size 512 \
    --threads 16

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /mnt/astrodata/llm/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = [
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.016 B
llm_load_print_meta: model size       = 12.881 GiB (8.501 BPW)
llm_load_print_meta: repeating layers = 12.556 GiB (8.501 BPW, 12.688 B parameters)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 13189.86 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   248.54 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    21.01 MiB
llama_new_context_with_model: graph nodes  = 1165
llama_new_context_with_model: graph splits = 443

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 102.865 ms
compute_imatrix: computing over 277 chunks with batch_size 512
compute_imatrix: 1.61 seconds per pass - ETA 7.40 minutes
[1]8.4429,[2]9.0054,[3]6.0236,[4]5.1203,[5]5.4399,[6]4.1193,[7]3.4893,[8]3.0374,[9]2.7789,
save_imatrix: stored collected data after 10 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[10]2.5790,[11]2.4134,[12]2.2756,[13]2.2770,[14]2.1808,[15]2.3420,[16]2.5229,[17]2.6719,[18]2.6744,[19]2.7924,
save_imatrix: stored collected data after 20 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[20]2.8546,[21]2.8265,[22]2.8108,[23]2.8379,[24]2.8256,[25]2.8160,[26]2.7995,[27]2.8185,[28]2.8211,[29]2.7253,
save_imatrix: stored collected data after 30 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[30]2.7067,[31]2.7767,[32]2.8058,[33]2.8023,[34]2.8008,[35]2.8470,[36]2.9396,[37]2.9690,[38]3.0215,[39]3.0308,
save_imatrix: stored collected data after 40 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[40]3.0702,[41]3.1383,[42]3.1813,[43]3.2972,[44]3.3731,[45]3.3880,[46]3.3992,[47]3.4017,[48]3.4401,[49]3.4639,
save_imatrix: stored collected data after 50 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[50]3.5158,[51]3.5482,[52]3.5576,[53]3.5790,[54]3.6064,[55]3.6440,[56]3.6855,[57]3.6976,[58]3.7151,[59]3.7365,
save_imatrix: stored collected data after 60 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[60]3.7198,[61]3.7105,[62]3.6868,[63]3.6517,[64]3.6643,[65]3.6569,[66]3.6335,[67]3.6463,[68]3.6364,[69]3.6098,
save_imatrix: stored collected data after 70 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[70]3.5796,[71]3.5663,[72]3.5423,[73]3.5180,[74]3.4853,[75]3.4602,[76]3.4389,[77]3.4079,[78]3.4590,[79]3.4885,
save_imatrix: stored collected data after 80 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[80]3.5384,[81]3.5655,[82]3.5703,[83]3.6146,[84]3.6383,[85]3.6433,[86]3.6712,[87]3.6529,[88]3.6616,[89]3.6659,
save_imatrix: stored collected data after 90 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[90]3.6578,[91]3.6563,[92]3.7242,[93]3.7772,[94]3.8348,[95]3.8650,[96]3.9093,[97]3.9215,[98]3.9316,[99]3.9614,
save_imatrix: stored collected data after 100 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[100]3.9870,[101]3.9926,[102]4.0022,[103]4.0078,[104]4.0241,[105]4.0021,[106]4.0216,[107]4.0284,[108]4.0321,[109]4.0764,
save_imatrix: stored collected data after 110 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[110]4.1078,[111]4.1195,[112]4.1347,[113]4.1305,[114]4.1078,[115]4.1262,[116]4.1317,[117]4.1305,[118]4.1626,[119]4.1574,
save_imatrix: stored collected data after 120 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[120]4.1461,[121]4.1457,[122]4.1450,[123]4.1398,[124]4.1433,[125]4.1565,[126]4.1668,[127]4.1812,[128]4.1865,[129]4.1768,
save_imatrix: stored collected data after 130 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[130]4.1734,[131]4.2002,[132]4.2067,[133]4.2000,[134]4.1810,[135]4.2081,[136]4.2197,[137]4.2454,[138]4.2620,[139]4.2528,
save_imatrix: stored collected data after 140 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[140]4.2720,[141]4.2953,[142]4.3222,[143]4.3403,[144]4.3690,[145]4.3883,[146]4.4270,[147]4.4502,[148]4.4468,[149]4.4334,
save_imatrix: stored collected data after 150 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[150]4.4587,[151]4.4729,[152]4.4900,[153]4.4847,[154]4.5262,[155]4.5341,[156]4.5600,[157]4.5479,[158]4.5551,[159]4.5649,
save_imatrix: stored collected data after 160 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[160]4.5906,[161]4.5990,[162]4.6071,[163]4.5763,[164]4.5561,[165]4.5295,[166]4.5200,[167]4.5107,[168]4.5148,[169]4.5286,
save_imatrix: stored collected data after 170 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[170]4.5453,[171]4.5400,[172]4.5458,[173]4.5576,[174]4.5648,[175]4.5852,[176]4.6067,[177]4.6488,[178]4.6855,[179]4.7140,
save_imatrix: stored collected data after 180 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[180]4.7434,[181]4.7628,[182]4.7820,[183]4.7710,[184]4.7853,[185]4.8189,[186]4.8460,[187]4.8477,[188]4.8348,[189]4.8479,
save_imatrix: stored collected data after 190 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[190]4.8627,[191]4.8802,[192]4.9172,[193]4.9458,[194]4.9610,[195]4.9765,[196]4.9902,[197]5.0011,[198]4.9910,[199]4.9894,
save_imatrix: stored collected data after 200 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[200]4.9818,[201]4.9788,[202]4.9866,[203]4.9945,[204]4.9993,[205]5.0029,[206]5.0112,[207]5.0217,[208]5.0205,[209]5.0324,
save_imatrix: stored collected data after 210 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[210]5.0529,[211]5.0635,[212]5.0756,[213]5.0723,[214]5.0873,[215]5.0975,[216]5.1073,[217]5.1171,[218]5.1213,[219]5.1426,
save_imatrix: stored collected data after 220 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[220]5.1445,[221]5.1374,[222]5.1562,[223]5.1764,[224]5.1933,[225]5.1982,[226]5.2087,[227]5.2195,[228]5.2394,[229]5.2261,
save_imatrix: stored collected data after 230 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[230]5.2269,[231]5.2197,[232]5.2361,[233]5.2403,[234]5.2375,[235]5.2346,[236]5.2321,[237]5.2252,[238]5.2216,[239]5.2124,
save_imatrix: stored collected data after 240 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[240]5.2077,[241]5.2022,[242]5.1967,[243]5.1920,[244]5.1865,[245]5.1891,[246]5.1968,[247]5.2214,[248]5.2460,[249]5.2682,
save_imatrix: stored collected data after 250 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[250]5.2993,[251]5.3283,[252]5.3306,[253]5.3429,[254]5.3461,[255]5.3590,[256]5.3653,[257]5.3726,[258]5.3645,[259]5.3569,
save_imatrix: stored collected data after 260 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[260]5.3674,[261]5.3848,[262]5.3862,[263]5.3887,[264]5.3941,[265]5.4030,[266]5.4143,[267]5.4201,[268]5.4234,[269]5.4354,
save_imatrix: stored collected data after 270 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat
[270]5.4386,[271]5.4424,[272]5.4521,[273]5.4564,[274]5.4639,[275]5.4791,[276]5.4830,[277]5.4973,
save_imatrix: stored collected data after 277 chunks in imatrix-calibration_data_v5_rc-llama-2-13b-chat.dat

Final estimate: PPL = 5.4973 +/- 0.05449

llama_print_timings:        load time =    2147.62 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  424676.05 ms / 141824 tokens (    2.99 ms per token,   333.96 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  428646.74 ms / 141825 tokens

======================== sorted layer importances
  0: Layer   0, <cos_sim> = 0.0804587
  1: Layer   1, <cos_sim> = 0.816333
  2: Layer   3, <cos_sim> = 0.855579
  3: Layer   2, <cos_sim> = 0.870939
  4: Layer   5, <cos_sim> = 0.882884
  5: Layer   7, <cos_sim> = 0.886822
  6: Layer   6, <cos_sim> = 0.891157
  7: Layer   4, <cos_sim> = 0.897281
  8: Layer   8, <cos_sim> = 0.898462
  9: Layer   9, <cos_sim> = 0.900521
 10: Layer  10, <cos_sim> = 0.910075
 11: Layer  11, <cos_sim> = 0.912746
 12: Layer  12, <cos_sim> = 0.916058
 13: Layer  13, <cos_sim> = 0.918256
 14: Layer  15, <cos_sim> = 0.921156
 15: Layer  16, <cos_sim> = 0.922013
 16: Layer  17, <cos_sim> = 0.923089
 17: Layer  14, <cos_sim> = 0.923667
 18: Layer  18, <cos_sim> = 0.935129
 19: Layer  21, <cos_sim> = 0.935497
 20: Layer  38, <cos_sim> = 0.938946
 21: Layer  20, <cos_sim> = 0.939555
 22: Layer  19, <cos_sim> = 0.939993
 23: Layer  22, <cos_sim> = 0.949833
 24: Layer  37, <cos_sim> = 0.952011
 25: Layer  23, <cos_sim> = 0.955484
 26: Layer  36, <cos_sim> = 0.956569
 27: Layer  24, <cos_sim> = 0.96045
 28: Layer  25, <cos_sim> = 0.963482
 29: Layer  35, <cos_sim> = 0.96357
 30: Layer  26, <cos_sim> = 0.963717
 31: Layer  34, <cos_sim> = 0.966742
 32: Layer  27, <cos_sim> = 0.967312
 33: Layer  33, <cos_sim> = 0.967905
 34: Layer  28, <cos_sim> = 0.96873
 35: Layer  32, <cos_sim> = 0.969066
 36: Layer  30, <cos_sim> = 0.969155
 37: Layer  29, <cos_sim> = 0.969895
 38: Layer  31, <cos_sim> = 0.969988

======================== sorted attention importances
  0: Layer   0, <cos_sim> = 0.253426
  1: Layer   1, <cos_sim> = 0.38511
  2: Layer   2, <cos_sim> = 0.568119
  3: Layer   3, <cos_sim> = 0.70009
  4: Layer   4, <cos_sim> = 0.753275
  5: Layer   5, <cos_sim> = 0.783473
  6: Layer   7, <cos_sim> = 0.822807
  7: Layer   6, <cos_sim> = 0.833536
  8: Layer   8, <cos_sim> = 0.85773
  9: Layer   9, <cos_sim> = 0.869933
 10: Layer  10, <cos_sim> = 0.870238
 11: Layer  11, <cos_sim> = 0.876139
 12: Layer  12, <cos_sim> = 0.880516
 13: Layer  15, <cos_sim> = 0.883828
 14: Layer  14, <cos_sim> = 0.890839
 15: Layer  13, <cos_sim> = 0.891501
 16: Layer  17, <cos_sim> = 0.892781
 17: Layer  16, <cos_sim> = 0.897206
 18: Layer  20, <cos_sim> = 0.90434
 19: Layer  19, <cos_sim> = 0.905305
 20: Layer  21, <cos_sim> = 0.905376
 21: Layer  18, <cos_sim> = 0.910555
 22: Layer  23, <cos_sim> = 0.921951
 23: Layer  26, <cos_sim> = 0.926056
 24: Layer  25, <cos_sim> = 0.927626
 25: Layer  24, <cos_sim> = 0.928499
 26: Layer  28, <cos_sim> = 0.936632
 27: Layer  22, <cos_sim> = 0.936688
 28: Layer  27, <cos_sim> = 0.939766
 29: Layer  29, <cos_sim> = 0.946173
 30: Layer  31, <cos_sim> = 0.950643
 31: Layer  39, <cos_sim> = 0.951655
 32: Layer  30, <cos_sim> = 0.952739
 33: Layer  32, <cos_sim> = 0.955543
 34: Layer  36, <cos_sim> = 0.955873
 35: Layer  34, <cos_sim> = 0.957643
 36: Layer  33, <cos_sim> = 0.958336
 37: Layer  38, <cos_sim> = 0.960393
 38: Layer  37, <cos_sim> = 0.960471
 39: Layer  35, <cos_sim> = 0.962264

======================== sorted ffn importances
  0: Layer   0, <cos_sim> = 0.562579
  1: Layer   1, <cos_sim> = 0.580676
  2: Layer   2, <cos_sim> = 0.616983
  3: Layer   3, <cos_sim> = 0.706686
  4: Layer   4, <cos_sim> = 0.731208
  5: Layer   6, <cos_sim> = 0.756786
  6: Layer   5, <cos_sim> = 0.757354
  7: Layer   7, <cos_sim> = 0.796257
  8: Layer   8, <cos_sim> = 0.815461
  9: Layer  10, <cos_sim> = 0.824589
 10: Layer   9, <cos_sim> = 0.826519
 11: Layer  11, <cos_sim> = 0.846745
 12: Layer  13, <cos_sim> = 0.859737
 13: Layer  14, <cos_sim> = 0.86228
 14: Layer  12, <cos_sim> = 0.866246
 15: Layer  16, <cos_sim> = 0.866582
 16: Layer  15, <cos_sim> = 0.868753
 17: Layer  18, <cos_sim> = 0.870342
 18: Layer  19, <cos_sim> = 0.870973
 19: Layer  17, <cos_sim> = 0.874143
 20: Layer  20, <cos_sim> = 0.886187
 21: Layer  22, <cos_sim> = 0.892857
 22: Layer  21, <cos_sim> = 0.902702
 23: Layer  23, <cos_sim> = 0.902868
 24: Layer  24, <cos_sim> = 0.904163
 25: Layer  25, <cos_sim> = 0.904319
 26: Layer  27, <cos_sim> = 0.914438
 27: Layer  26, <cos_sim> = 0.917688
 28: Layer  28, <cos_sim> = 0.926051
 29: Layer  38, <cos_sim> = 0.927326
 30: Layer  29, <cos_sim> = 0.92942
 31: Layer  30, <cos_sim> = 0.932488
 32: Layer  35, <cos_sim> = 0.934298
 33: Layer  31, <cos_sim> = 0.934668
 34: Layer  37, <cos_sim> = 0.935018
 35: Layer  33, <cos_sim> = 0.936569
 36: Layer  32, <cos_sim> = 0.938647
 37: Layer  36, <cos_sim> = 0.938813
 38: Layer  34, <cos_sim> = 0.94036

Really appreciate your implementing this for further experimentation! Gotta run for now but will dig in more later this week! Thanks!