Torch has alignment enforcement when viewing with data type changes
but only relative to itself. Do all tensor constructions straight
off the memory-view individually so pytorch doesnt see an alignment
problem.
The is needed for handling misaligned safetensors weights, which are
reasonably common in third party models.
This limits usage of this safetensors loader to GPU compute only
as CPUs kernnel are very likely to bus error. But it works for
dynamic_vram, where we really dont want to take a deep copy and we
always use GPU copy_ which disentangles the misalignment.
This is using a different layers weight with .to(). Change it to use
the ops caster if the original layer is a comfy weight so that it picks
up dynamic_vram and async_offload functionality in full.
Co-authored-by: Rattus <rattus128@gmail.com>
* mp: fix full dynamic unloading
This was not unloading dynamic models when requesting a full unload via
the unpatch() code path.
This was ok, i your workflow was all dynamic models but fails with big
VRAM leaks if you need to fully unload something for a regular ModelPatcher
It also fices the "unload models" button.
* mm: load models outside of Aimdo Mempool
In dynamic_vram mode, escape the Aimdo mempool and load into the regular
mempool. Use a dummy thread to do it.
This function has a dtype argument that allows the caller to set the
dtype in the cast. TIL Some models override this on weight casts, which
means its the highest priority.
Priority scheme is: argument > model dtype > state dict dtype
pinned memory was converted back to pinning the CPU side weight without
any changes. Fix the pinner to use the CPU weight and not the model defined
geometry. This will either save RAM or stop buffer overruns when the types
mismatch.
Fix the model defined weight caster to use the [ s.weight, s.bias ]
interpretation, as xfer_dest might be the flattened pin now. Fix the detection
of needing to cast to not be conditional on !pin.
When a node is declared as dev-only, it doesn't show in the default UI
unless the dev mode is enabled in the settings. The intention is to
allow nodes related to unit testing to be included in ComfyUI
distributions without confusing the average user.
The code throughout is None safe to just skip the feature cache saving
step if none. Set it none in single frame use so qwen doesn't burn VRAM
on the unused cache.
* ops: introduce autopad for conv3d
This works around pytorch missing ability to causal pad as part of the
kernel and avoids massive weight duplications for padding.
* wan-vae: rework causal padding
This currently uses F.pad which takes a full deep copy and is liable to
be the VRAM peak. Instead, kick spatial padding back to the op and
consolidate the temporal padding with the cat for the cache.
* wan-vae: implement zero pad fast path
The WAN VAE is also QWEN where it is used single-image. These
convolutions are however zero padded 3d convolutions, which means the
VAE is actually just 2D down the last element of the conv weight in
the temporal dimension. Fast path this, to avoid adding zeros that
then just evaporate in convoluton math but cost computation.
* Disable timestep embed compression when inpainting
Spatial inpainting not compatible with the compression
* Reduce crossattn peak VRAM
* LTX2: Refactor forward function for better VRAM efficiency
* causal_video_ae: Remove attention ResNet
This attention_head_dim argument does not exist on this constructor so
this is dead code. Remove as generic attention mid VAE conflicts with
temporal roll.
* ltx-vae: consoldate causal/non-causal code paths
* ltx-vae: add cache rolling adder
* ltx-vae: use cached adder for resnet
* ltx-vae: Implement rolling VAE
Implement a temporal rolling VAE for the LTX2 VAE.
Usually when doing temporal rolling VAEs you can just chunk on time relying
on causality and cache behind you as you go. The LTX VAE is however
non-causal.
So go whole hog and implement per layer run ahead and backpressure between
the decoder layers using recursive state beween the layers.
Operations are ammended with temporal_cache_state{} which they can use to
hold any state then need for partial execution. Convolutions cache their
inputs behind the up to N-1 frames, and skip connections need to cache the
mismatch between convolution input and output that happens due to missing
future (non-causal) input.
Each call to run_up() processes a layer accross a range on input that
may or may not be complete. It goes depth first to process as much as
possible to try and digest frames to the final output ASAP. If layers run
out of input due to convolution losses, they simply return without action
effectively applying back-pressure to the earlier layers. As the earlier
layers do more work and caller deeper, the partial states are reconciled
and output continues to digest depth first as much as possible.
Chunking is done using a size quota rather than a fixed frame length and
any layer can initiate chunking, and multiple layers can chunk at different
granulatiries. This remove the old limitation of always having to process
1 latent frame to entirety and having to hold 8 full decoded frames as
the VRAM peak.
* re-init
* Update model_multitalk.py
* whitespace...
* Update model_multitalk.py
* remove print
* this is redundant
* remove import
* Restore preview functionality
* Move block_idx to transformer_options
* Remove LoopingSamplerCustomAdvanced
* Remove looping functionality, keep extension functionality
* Update model_multitalk.py
* Handle ref_attn_mask with separate patch to avoid having to always return q and k from self_attn
* Chunk attention map calculation for multiple speakers to reduce peak VRAM usage
* Update model_multitalk.py
* Add ModelPatch type back
* Fix for latest upstream
* Use DynamicCombo for cleaner node
Basically just so that single_speaker mode hides mask inputs and 2nd audio input
* Update nodes_wan.py
For LTX Audio VAE, remove normalization of audio during MEL spectrogram creation.
This aligs inference with training and prevents loud audio from being attenuated.