* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: command line argument to disable it
* Faster tensor name formatting
We gain ~1% for Ling-mini-2.0 when running on CUDA.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>