* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* Not working attempt to extend fused_mul_unary to the Step-3.5 case
* It works now, but performance gain is very minor