With this ggml_mul_mat_ext, he hit PP-512 = 209 t/s (iq1_bn) and
PP-512 = 246 t/s (iq2_bn) on the M2 Max CPU.
On the Ryzen-7950X we are at PP-512 = 447 t/s (iq1_bn, 32 threads)
and PP-512 = 530 t/s (iq2_bn, 16 threads).
i.e., as in your typical GEMM interface.
For Bitnet this gives ~1% speedup for PP, no effect for TG.
Vey yeasy to implement for the CPU using iqk_mul_mat.
But given that every other backend requires a lot of change,
and given the just 1% speedup (which only applies to Bitnet),
it does not look like it is worth putting in the effort.