Torch has alignment enforcement when viewing with data type changes
but only relative to itself. Do all tensor constructions straight
off the memory-view individually so pytorch doesnt see an alignment
problem.
The is needed for handling misaligned safetensors weights, which are
reasonably common in third party models.
This limits usage of this safetensors loader to GPU compute only
as CPUs kernnel are very likely to bus error. But it works for
dynamic_vram, where we really dont want to take a deep copy and we
always use GPU copy_ which disentangles the misalignment.
This is using a different layers weight with .to(). Change it to use
the ops caster if the original layer is a comfy weight so that it picks
up dynamic_vram and async_offload functionality in full.
Co-authored-by: Rattus <rattus128@gmail.com>
* mp: fix full dynamic unloading
This was not unloading dynamic models when requesting a full unload via
the unpatch() code path.
This was ok, i your workflow was all dynamic models but fails with big
VRAM leaks if you need to fully unload something for a regular ModelPatcher
It also fices the "unload models" button.
* mm: load models outside of Aimdo Mempool
In dynamic_vram mode, escape the Aimdo mempool and load into the regular
mempool. Use a dummy thread to do it.
This function has a dtype argument that allows the caller to set the
dtype in the cast. TIL Some models override this on weight casts, which
means its the highest priority.
Priority scheme is: argument > model dtype > state dict dtype
pinned memory was converted back to pinning the CPU side weight without
any changes. Fix the pinner to use the CPU weight and not the model defined
geometry. This will either save RAM or stop buffer overruns when the types
mismatch.
Fix the model defined weight caster to use the [ s.weight, s.bias ]
interpretation, as xfer_dest might be the flattened pin now. Fix the detection
of needing to cast to not be conditional on !pin.
When a node is declared as dev-only, it doesn't show in the default UI
unless the dev mode is enabled in the settings. The intention is to
allow nodes related to unit testing to be included in ComfyUI
distributions without confusing the average user.
The code throughout is None safe to just skip the feature cache saving
step if none. Set it none in single frame use so qwen doesn't burn VRAM
on the unused cache.
* ops: introduce autopad for conv3d
This works around pytorch missing ability to causal pad as part of the
kernel and avoids massive weight duplications for padding.
* wan-vae: rework causal padding
This currently uses F.pad which takes a full deep copy and is liable to
be the VRAM peak. Instead, kick spatial padding back to the op and
consolidate the temporal padding with the cat for the cache.
* wan-vae: implement zero pad fast path
The WAN VAE is also QWEN where it is used single-image. These
convolutions are however zero padded 3d convolutions, which means the
VAE is actually just 2D down the last element of the conv weight in
the temporal dimension. Fast path this, to avoid adding zeros that
then just evaporate in convoluton math but cost computation.
* Disable timestep embed compression when inpainting
Spatial inpainting not compatible with the compression
* Reduce crossattn peak VRAM
* LTX2: Refactor forward function for better VRAM efficiency