* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* Not working attempt to extend fused_mul_unary to the Step-3.5 case
* It works now, but performance gain is very minor
* Fix graph parallel when ngl < n_layers
* Fix using ffn_norm
When using graph parallel with ngl < n_layers, the ffn_norm tensor
may have ended up being split, while the ffn tensors are on the CPU.
In that case we will get a crash because we attempt to use the not-split
buffer of ffn_norm, which is invalid. Thi commit fixes that.
* Cleanup
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
* WIP: graph parallel for Step-3.5
* WIP
* This should be it
* Cleanup
* Fix merge
* WIP
* This works but is slow
* Turn off the up / gate clamps for now
* OK we need the clamping
* Fuse the clamp (CUDA)
* Fuse the clamp (CPU)
* WIP
* Be able to use merged q, k, v
* Be able to use merged up/gate experts
* Fuse the clamp (CUDA mmvq)
This implements the ability to load, unload, and scale control vectors
(representation engineering) mid-inference, following the existing
task-queue pattern used by LoRA adapters.
New Endpoints:
- GET /control-vectors
- POST /control-vectors/load
- POST /control-vectors/unload
- POST /control-vectors/apply (handles scaling)
Technical Notes:
- Centralizes vector aggregation logic to share implementation between
load, unload, and apply tasks.
- Vectors are applied globally to the model context.
- Enforces dimension validation on load to safely reject incompatible
vectors.
Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
* adaptive_p: fix history update + use current probability for high temp
* adaptive_p: fix history update bug, update with current probability if temp is high
* replace temp-as-signal with server argument
* adaptive_p: rename ema_w_cur_p to updt_w_cur
* delete test code