stable-diffusion-webui-forge

mirror of https://github.com/lllyasviel/stable-diffusion-webui-forge.git synced 2026-02-10 01:49:58 +00:00

Author	SHA1	Message	Date
layerdiffusion	d1d0ec46aa	Maintain patching related 1. fix several problems related to layerdiffuse not unloaded 2. fix several problems related to Fooocus inpaint 3. Slightly speed up on-the-fly LoRAs by precomputing them to computation dtype	2024-08-30 15:18:21 -07:00
layerdiffusion	4c9380c46a	Speed up quant model loading and inference ... ... based on 3 evidences: 1. torch.Tensor.view on one big tensor is slightly faster than calling torch.Tensor.to on multiple small tensors. 2. but torch.Tensor.to with dtype change is significantly slower than torch.Tensor.view 3. “baking” model on GPU is significantly faster than computing on CPU when model load. mainly influence inference of Q8_0, Q4_0/1/K and loading of all quants	2024-08-30 00:49:05 -07:00
layerdiffusion	0abb6c4686	Second Attempt for #1502	2024-08-28 08:08:40 -07:00
layerdiffusion	13d6f8ed90	revise GGUF by precomputing some parameters rather than computing them in each diffusion iteration	2024-08-25 14:30:09 -07:00
layerdiffusion	1096c708cc	revise swap module name	2024-08-20 21:18:53 -07:00
layerdiffusion	5452bc6ac3	All Forge Spaces Now Pass 4GB VRAM and they all 100% reproduce author results	2024-08-20 08:01:10 -07:00
layerdiffusion	6f411a4940	fix loras on nf4 models when activate "loras in fp16"	2024-08-20 01:29:52 -07:00
layerdiffusion	d03fc5c2b1	speed up a bit	2024-08-19 05:06:46 -07:00
layerdiffusion	d38e560e42	Implement some rethinking about LoRA system 1. Add an option to allow users to use UNet in fp8/gguf but lora in fp16. 2. All FP16 loras do not need patch. Others will only patch again when lora weight change. 3. FP8 unet + fp16 lora are available (somewhat only available) in Forge now. This also solves some “LoRA too subtle” problems. 4. Significantly speed up all gguf models (in Async mode) by using independent thread (CUDA stream) to compute and dequant at the same time, even when low-bit weights are already on GPU. 5. View “online lora” as a module similar to ControlLoRA so that it is moved to GPU together with model when sampling, achieving significant speedup and perfect low VRAM management simultaneously.	2024-08-19 04:31:59 -07:00
layerdiffusion	8a04293430	fix some gguf loras	2024-08-17 01:15:37 -07:00
layerdiffusion	04e7f05769	speedup swap/loading of all quant types	2024-08-16 08:30:11 -07:00
layerdiffusion	c74f603ea2	remove super call	2024-08-15 00:23:31 -07:00
layerdiffusion	d8b83a9501	gguf preview	2024-08-15 00:03:32 -07:00
lllyasviel	cfa5242a75	forge 2.0.0 see also discussions	2024-08-10 19:24:19 -07:00
layerdiffusion	60c5aea11b	revise stream logics	2024-08-08 18:45:36 -07:00
layerdiffusion	a91a81d8e6	revise structure	2024-08-07 20:44:34 -07:00
lllyasviel	a6baf4a4b5	revise kernel and add unused files	2024-08-07 16:51:24 -07:00
layerdiffusion	b57573c8da	Implement many kernels from scratch	2024-08-06 20:19:03 -07:00
lllyasviel	71c94799d1	diffusion in fp8 landed	2024-08-06 16:47:39 -07:00
layerdiffusion	78e6933dcb	fix cast	2024-08-05 05:26:04 -07:00
layerdiffusion	430482d1a0	fix cast	2024-08-03 16:08:22 -07:00
layerdiffusion	e722991752	control rework	2024-08-02 22:17:27 -07:00
layerdiffusion	1b2610db3e	implement stream in new backend	2024-07-29 11:16:59 -06:00
layerdiffusion	9793b4be0f	implement operations from scratch	2024-07-29 10:59:16 -06:00

24 Commits