Commit Graph

31 Commits

Author SHA1 Message Date
layerdiffusion
68bf7f85aa speed up nf4 lora in offline patching mode 2024-08-22 10:35:11 -07:00
layerdiffusion
95d04e5c8f fix 2024-08-22 10:08:21 -07:00
layerdiffusion
14eac6f2cf add a way to empty cuda cache on the fly 2024-08-22 10:06:39 -07:00
layerdiffusion
909ad6c734 fix prints 2024-08-21 22:24:54 -07:00
layerdiffusion
4e3c78178a [revised] change some dtype behaviors based on community feedbacks
only influence old devices like 1080/70/60/50.
please remove cmd flags if you are on 1080/70/60/50 and previously used many cmd flags to tune performance
2024-08-21 10:23:38 -07:00
layerdiffusion
1419ef29aa Revert "change some dtype behaviors based on community feedbacks"
This reverts commit 31bed671ac.
2024-08-21 10:10:49 -07:00
layerdiffusion
31bed671ac change some dtype behaviors based on community feedbacks
only influence old devices like 1080/70/60/50.
please remove cmd flags if you are on 1080/70/60/50 and previously used many cmd flags to tune performance
2024-08-21 08:46:52 -07:00
layerdiffusion
475524496d revise 2024-08-19 18:54:54 -07:00
layerdiffusion
d7151b4dcd add low vram warning 2024-08-19 11:08:01 -07:00
layerdiffusion
d38e560e42 Implement some rethinking about LoRA system
1. Add an option to allow users to use UNet in fp8/gguf but lora in fp16.
2. All FP16 loras do not need patch. Others will only patch again when lora weight change.
3. FP8 unet + fp16 lora are available (somewhat only available) in Forge now. This also solves some “LoRA too subtle” problems.
4. Significantly speed up all gguf models (in Async mode) by using independent thread (CUDA stream) to compute and dequant at the same time, even when low-bit weights are already on GPU.
5. View “online lora” as a module similar to ControlLoRA so that it is moved to GPU together with model when sampling, achieving significant speedup and perfect low VRAM management simultaneously.
2024-08-19 04:31:59 -07:00
layerdiffusion
ab4b0d5b58 fix some mem leak 2024-08-17 00:19:43 -07:00
layerdiffusion
394da01959 simplify 2024-08-16 04:55:01 -07:00
layerdiffusion
e36487ffa5 tune 2024-08-16 04:49:25 -07:00
lllyasviel
6e6e5c2162 do some profile on 3090 2024-08-16 04:43:19 -07:00
layerdiffusion
7c0f78e424 reduce cast 2024-08-16 03:59:59 -07:00
layerdiffusion
d8b83a9501 gguf preview 2024-08-15 00:03:32 -07:00
layerdiffusion
59790f2cb4 simplify codes 2024-08-14 20:48:39 -07:00
layerdiffusion
b31f81628f Revert "simplify codes"
This reverts commit 2cc5aa7a3e.
2024-08-14 20:39:00 -07:00
layerdiffusion
2cc5aa7a3e simplify codes 2024-08-14 20:35:28 -07:00
layerdiffusion
aff742b597 speed up lora using cuda profile 2024-08-14 19:09:35 -07:00
lllyasviel
61f83dd610 support all flux models 2024-08-13 05:42:17 -07:00
layerdiffusion
f6ef105cb3 fix wrong print 2024-08-12 03:58:58 -07:00
layerdiffusion
a16ca5d057 fix amd 2024-08-11 17:53:08 -07:00
lllyasviel
cfa5242a75 forge 2.0.0
see also discussions
2024-08-10 19:24:19 -07:00
layerdiffusion
6f254f3599 revise stream 2024-08-08 20:18:56 -07:00
layerdiffusion
60c5aea11b revise stream logics 2024-08-08 18:45:36 -07:00
layerdiffusion
e1df7a1bae revise kernel 2024-08-07 17:24:22 -07:00
layerdiffusion
b57573c8da Implement many kernels from scratch 2024-08-06 20:19:03 -07:00
lllyasviel
71c94799d1 diffusion in fp8 landed 2024-08-06 16:47:39 -07:00
layerdiffusion
318219bc9d move file 2024-08-02 03:37:20 -07:00
layerdiffusion
bc9977a305 UNet from Scratch
Now backend rewrite is about 50% finished.
Estimated finish is in 72 hours.
After that, many newer features will land.
2024-08-01 21:19:41 -07:00