2.5 KiB
🔀 #338 - BitNet adjustments
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-04-22 |
| Updated | 2025-04-22 |
Description
Two small tweaks to #337:
- Use
create_tensorinstead ofml.create_tensor. This is necessary for tensor overrides to work (in case one would ever want to use tensor overrides with a BitNet model) - Use
output.weightinstead oftoken_embd.weightfor the final matrix multiplication. This improves CUDA performance quite a bit astoken_embd.weightis on the host, so needs to be copied to the GPU each time it is needed (or the matrix multiplication is done on the CPU when running TG). I see that MicroSoft have decided to haveoutput.weightstored in the model, even though it is identical totoken_embd.weight(in the initial BitNet models one simply reusedtoken_embd.weight). This makes the model quite a bit larger than it needs to be. Go figure.
💬 Conversation
👤 saood06 commented the 2025-04-22 at 07:01:54:
* Use `create_tensor` instead of `ml.create_tensor`. This is necessary for tensor overrides to work (in case one would ever want to use tensor overrides with a BitNet model)
Yes I noticed that, I just didn't want to change until I tested if it worked first.
Use
output.weightinstead oftoken_embd.weightfor the final matrix multiplication. This improves CUDA performance quite a bit astoken_embd.weightis on the host, so needs to be copied to the GPU each time it is needed (or the matrix multiplication is done on the CPU when running TG). I see that MicroSoft have decided to haveoutput.weightstored in the model, even though it is identical totoken_embd.weight(in the initial BitNet models one simply reusedtoken_embd.weight). This makes the model quite a bit larger than it needs to be. Go figure.
Interesting. There is a discussion on the huggingface that the model is larger than it has to be. Can we have change this to have smaller model size or is the performance benefit worth it (if it can't be duplicated on runtime for CUDA)?
I also noticed when converting the two tensors ended up with different quants.
[ 1/ 333] output.weight - [ 2560, 128256, 1, 1], type = f16, converting to q6_K .. size = 626.25 MiB -> 256.86 MiB
[ 2/ 333] token_embd.weight - [ 2560, 128256, 1, 1], type = f16, converting to iq4_nl .. size = 626.25 MiB -> 176.13 MiB