Commit Graph

24 Commits

Author SHA1 Message Date
layerdiffusion
d1d0ec46aa Maintain patching related
1. fix several problems related to layerdiffuse not unloaded
2. fix several problems related to Fooocus inpaint
3. Slightly speed up on-the-fly LoRAs by precomputing them to computation dtype
2024-08-30 15:18:21 -07:00
layerdiffusion
4c9380c46a Speed up quant model loading and inference ...
... based on 3 evidences:
1. torch.Tensor.view on one big tensor is slightly faster than calling torch.Tensor.to on multiple small tensors.
2. but torch.Tensor.to with dtype change is significantly slower than torch.Tensor.view
3. “baking” model on GPU is significantly faster than computing on CPU when model load.

mainly influence inference of Q8_0, Q4_0/1/K and loading of all quants
2024-08-30 00:49:05 -07:00
layerdiffusion
0abb6c4686 Second Attempt for #1502 2024-08-28 08:08:40 -07:00
layerdiffusion
13d6f8ed90 revise GGUF by precomputing some parameters
rather than computing them in each diffusion iteration
2024-08-25 14:30:09 -07:00
layerdiffusion
1096c708cc revise swap module name 2024-08-20 21:18:53 -07:00
layerdiffusion
5452bc6ac3 All Forge Spaces Now Pass 4GB VRAM
and they all 100% reproduce author results
2024-08-20 08:01:10 -07:00
layerdiffusion
6f411a4940 fix loras on nf4 models when activate "loras in fp16" 2024-08-20 01:29:52 -07:00
layerdiffusion
d03fc5c2b1 speed up a bit 2024-08-19 05:06:46 -07:00
layerdiffusion
d38e560e42 Implement some rethinking about LoRA system
1. Add an option to allow users to use UNet in fp8/gguf but lora in fp16.
2. All FP16 loras do not need patch. Others will only patch again when lora weight change.
3. FP8 unet + fp16 lora are available (somewhat only available) in Forge now. This also solves some “LoRA too subtle” problems.
4. Significantly speed up all gguf models (in Async mode) by using independent thread (CUDA stream) to compute and dequant at the same time, even when low-bit weights are already on GPU.
5. View “online lora” as a module similar to ControlLoRA so that it is moved to GPU together with model when sampling, achieving significant speedup and perfect low VRAM management simultaneously.
2024-08-19 04:31:59 -07:00
layerdiffusion
8a04293430 fix some gguf loras 2024-08-17 01:15:37 -07:00
layerdiffusion
04e7f05769 speedup swap/loading of all quant types 2024-08-16 08:30:11 -07:00
layerdiffusion
c74f603ea2 remove super call 2024-08-15 00:23:31 -07:00
layerdiffusion
d8b83a9501 gguf preview 2024-08-15 00:03:32 -07:00
lllyasviel
cfa5242a75 forge 2.0.0
see also discussions
2024-08-10 19:24:19 -07:00
layerdiffusion
60c5aea11b revise stream logics 2024-08-08 18:45:36 -07:00
layerdiffusion
a91a81d8e6 revise structure 2024-08-07 20:44:34 -07:00
lllyasviel
a6baf4a4b5 revise kernel
and add unused files
2024-08-07 16:51:24 -07:00
layerdiffusion
b57573c8da Implement many kernels from scratch 2024-08-06 20:19:03 -07:00
lllyasviel
71c94799d1 diffusion in fp8 landed 2024-08-06 16:47:39 -07:00
layerdiffusion
78e6933dcb fix cast 2024-08-05 05:26:04 -07:00
layerdiffusion
430482d1a0 fix cast 2024-08-03 16:08:22 -07:00
layerdiffusion
e722991752 control rework 2024-08-02 22:17:27 -07:00
layerdiffusion
1b2610db3e implement stream in new backend 2024-07-29 11:16:59 -06:00
layerdiffusion
9793b4be0f implement operations from scratch 2024-07-29 10:59:16 -06:00