Implements per-guide attention attenuation via log-space additive bias
in self-attention. Each guide reference tracks its own strength and
optional spatial mask in conditioning metadata (guide_attention_entries).
* utils: dont use comfy sft loader in aimdo fallback
This was going to the raw command line switch and should respect main.py
probe of whether aimdo actually loaded successfully.
* ops: dont use deferred linear load in Aimdo fallback
Avoid changes of behaviour on --fast dynamic_vram when aimdo doesnt work.
* mp: attach re-construction arguments to model patcher
When making a model-patcher from a unet or ckpt, attach a callable
function that can be called to replay the model construction. This
can be used to deep clone model patcher WRT the actual model.
Originally written by Kosinkadink
f4b99bc623
* mp: Add disable_dynamic clone argument
Add a clone argument that lets a caller clone a ModelPatcher but disable
dynamic to demote the clone to regular MP. This is useful for legacy
features where dynamic_vram support is missing or TBD.
* torch_compile: disable dynamic_vram
This is a bigger feature. Disable for the interim to preserve
functionality.
Integrate comfy-aimdo 0.2 which takes a different approach to
installing the memory allocator hook. Instead of using the complicated
and buggy pytorch MemPool+CudaPluggableAlloctor, cuda is directly hooked
making the process much more transparent to both comfy and pytorch. As
far as pytorch knows, aimdo doesnt exist anymore, and just operates
behind the scenes.
Remove all the mempool setup stuff for dynamic_vram and bump the
comfy-aimdo version. Remove the allocator object from memory_management
and demote its use as an enablment check to a boolean flag.
Comfy-aimdo 0.2 also support the pytorch cuda async allocator, so
remove the dynamic_vram based force disablement of cuda_malloc and
just go back to the old settings of allocators based on command line
input.
This check was far too broad and the dtype is not a reliable indicator
of wanting the requant (as QT returns the compute dtype as the dtype).
So explictly plumb whether fp8mm wants the requant or not.
* lora: add weight shape calculations.
This lets the loader know if a lora will change the shape of a weight
so it can take appropriate action.
* MPDynamic: force load flux img_in weight
This weight is a bit special, in that the lora changes its geometry.
This is rather unique, not handled by existing estimate and doesn't
work for either offloading or dynamic_vram.
Fix for dynamic_vram as a special case. Ideally we can fully precalculate
these lora geometry changes at load time, but just get these models
working first.
Get rid of the cat and unary negation and inplace add-cmul the two
halves of the rope. Precompute -sin once at the start of the model
rather than every transformer block.
This is slightly faster on both GPU and CPU bound setups.
The current behaviour of the default ModelPatcher is to .to a model
only if its fully loaded, which is how random non-leaf weights get
loaded in non-LowVRAM conditions.
The however means they never get loaded in dynamic_vram. In the
dynamic_vram case, force load them to the GPU.
* model_management: lazy-cache aimdo_tensor
These tensors cosntructed from aimdo-allocations are CPU expensive to
make on the pytorch side. Add a cache version that will be valid with
signature match to fast path past whatever torch is doing.
* dynamic_vram: Minimize fast path CPU work
Move as much as possible inside the not resident if block and cache
the formed weight and bias rather than the flat intermediates. In
extreme layer weight rates this adds up.
* Fix bypass dtype/device moving
* Force offloading mode for training
* training context var
* offloading implementation in training node
* fix wrong input type
* Support bypass load lora model, correct adapter/offloading handling
This was missing the stochastic rounding required for fp8 downcast
to be consistent with model_patcher.patch_weight_to_device.
Missed in testing as I spend too much time with quantized tensors
and overlooked the simpler ones.
If there are non-trivial python objects nested in the model_options, this
causes all sorts of issues. Traverse lists and dicts so clones can safely
overide settings and BYO objects but stop there on the deepclone.
* revert threaded model loader change
This change was only needed to get around the pytorch 2.7 mempool bugs,
and should have been reverted along with #12260. This fixes a different
memory leak where pytorch gets confused about cache emptying.
* load non comfy weights
* MPDynamic: Pre-generate the tensors for vbars
Apparently this is an expensive operation that slows down things.
* bump to aimdo 1.8
New features:
watermark limit feature
logging enhancements
-O2 build on linux