mirror of
https://github.com/lllyasviel/stable-diffusion-webui-forge.git
synced 2026-02-24 08:43:57 +00:00
by precomputing all possible 4bit dequant into a lookup table and use pytorch indexing to get dequant, rather than really computing the bit operations. This should give very similar performance to native CUDA kernels, while being LoRA friendly and more flexiable
Please follow the standard of 315f85d4f4/3rdparty when PR or modifying files.