* ops: introduce autopad for conv3d
This works around pytorch missing ability to causal pad as part of the
kernel and avoids massive weight duplications for padding.
* wan-vae: rework causal padding
This currently uses F.pad which takes a full deep copy and is liable to
be the VRAM peak. Instead, kick spatial padding back to the op and
consolidate the temporal padding with the cat for the cache.
* wan-vae: implement zero pad fast path
The WAN VAE is also QWEN where it is used single-image. These
convolutions are however zero padded 3d convolutions, which means the
VAE is actually just 2D down the last element of the conv weight in
the temporal dimension. Fast path this, to avoid adding zeros that
then just evaporate in convoluton math but cost computation.
* Disable timestep embed compression when inpainting
Spatial inpainting not compatible with the compression
* Reduce crossattn peak VRAM
* LTX2: Refactor forward function for better VRAM efficiency
* causal_video_ae: Remove attention ResNet
This attention_head_dim argument does not exist on this constructor so
this is dead code. Remove as generic attention mid VAE conflicts with
temporal roll.
* ltx-vae: consoldate causal/non-causal code paths
* ltx-vae: add cache rolling adder
* ltx-vae: use cached adder for resnet
* ltx-vae: Implement rolling VAE
Implement a temporal rolling VAE for the LTX2 VAE.
Usually when doing temporal rolling VAEs you can just chunk on time relying
on causality and cache behind you as you go. The LTX VAE is however
non-causal.
So go whole hog and implement per layer run ahead and backpressure between
the decoder layers using recursive state beween the layers.
Operations are ammended with temporal_cache_state{} which they can use to
hold any state then need for partial execution. Convolutions cache their
inputs behind the up to N-1 frames, and skip connections need to cache the
mismatch between convolution input and output that happens due to missing
future (non-causal) input.
Each call to run_up() processes a layer accross a range on input that
may or may not be complete. It goes depth first to process as much as
possible to try and digest frames to the final output ASAP. If layers run
out of input due to convolution losses, they simply return without action
effectively applying back-pressure to the earlier layers. As the earlier
layers do more work and caller deeper, the partial states are reconciled
and output continues to digest depth first as much as possible.
Chunking is done using a size quota rather than a fixed frame length and
any layer can initiate chunking, and multiple layers can chunk at different
granulatiries. This remove the old limitation of always having to process
1 latent frame to entirety and having to hold 8 full decoded frames as
the VRAM peak.
* re-init
* Update model_multitalk.py
* whitespace...
* Update model_multitalk.py
* remove print
* this is redundant
* remove import
* Restore preview functionality
* Move block_idx to transformer_options
* Remove LoopingSamplerCustomAdvanced
* Remove looping functionality, keep extension functionality
* Update model_multitalk.py
* Handle ref_attn_mask with separate patch to avoid having to always return q and k from self_attn
* Chunk attention map calculation for multiple speakers to reduce peak VRAM usage
* Update model_multitalk.py
* Add ModelPatch type back
* Fix for latest upstream
* Use DynamicCombo for cleaner node
Basically just so that single_speaker mode hides mask inputs and 2nd audio input
* Update nodes_wan.py
For LTX Audio VAE, remove normalization of audio during MEL spectrogram creation.
This aligs inference with training and prevents loud audio from being attenuated.
* Brought over minimal elements from PR 10045 to reproduce seed_assets and register_assets_system without adding anything to the DB or server routes yet, for now making everything sync (can introduce async once everything is cleaned up and brought over)
* Added db script to insert assets stuff, cleaned up some code; assets (models) now get added/rescanned
* Added support for 5 http endpoints for assets
* Replaced Optional with | None in schemas_in.py and schemas_out.py
* Remove two routes that will not be relevant yet in this PR: HEAD /api/assets/hash/<hash> and PUT /api/assets/<id>/preview
* Remove some functions the two deleted endpoints were using
* Don't show assets scan message upon calling /object_info endpoint
* removed unsued import to satisfy ruff
* Simplified hashing function tpye hint and _hash_file_obj
* Satisfied ruff
This logic was checking comfy_cast_weights, and going straight to
to the forward_comfy_cast_weights implementation without
attempting to downscale input to fp8 in the event comfy_cast_weights
is set.
The main reason comfy_cast_weights would be set would be for async
offload, which is not a good reason to nix FP8MM.
So instead, and together the underlying exclusions for FP8MM which
are:
* having a weight_function (usually LowVramPatch)
* force_cast_weights (compute dtype override)
* the weight is not Quantized
* the input is already quantized
* the model or layer has MM explictily disabled.
If you get past all of those exclusions, quantize the input tensor.
Then hand the new input, quantized or not off to
forward_comfy_cast_weights to handle it. If the weight is offloaded
but input is quantized you will get an offloaded MM8.