* Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>