* Add support for sage attention 3 in comfyui, enable via new cli arg
--use-sage-attiention3
* Fix some bugs found in PR review. The N dimension at which Sage
Attention 3 takes effect is reduced to 1024 (although the improvement is
not significant at this scale).
* Remove the Sage Attention3 switch, but retain the attention function
registration.
* Fix a ruff check issue in attention.py
Pretty much every error cudaHostRegister can throw also queues the same
error on the async GPU queue. This was fixed for repinning error case,
but there is the bad mmap and just enomem cases that are harder to
detect.
Do some dummy GPU work to clean the error state.
index_timestep_zero can be selected in the
FluxKontextMultiReferenceLatentMethod now with the display name set to the
more generic "Edit Model Reference Method" node.
The inpaint part is currently missing and will be implemented later.
I think they messed up this model pretty bad. They added some
control_noise_refiner blocks but don't actually use them. There is a typo
in their code so instead of doing control_noise_refiner -> control_layers
it runs the whole control_layers twice.
Unfortunately they trained with this typo so the model works but is kind
of slow and would probably perform a lot better if they corrected their
code and trained it again.
* make setattr safe for non existent attributes
Handle the case where the attribute doesnt exist by returning a static
sentinel (distinct from None). If the sentinel is passed in as the set
value, del the attr.
* Account for dequantization and type-casts in offload costs
When measuring the cost of offload, identify weights that need a type
change or dequantization and add the size of the conversion result
to the offload cost.
This is mutually exclusive with lowvram patches which already has
a large conservative estimate and wont overlap the dequant cost so\
dont double count.
* Set the compute type on CLIP MPs
So that the loader can know the size of weights for dequant accounting.
In the lowvram case, this now does its math in the model dtype in the
post de-quantization domain. Account for that. The patching was also
put back on the compute stream getting it off-peak so relax the
MATH_FACTOR to only x2 so get out of the worst-case assumption of
everything peaking at once.