14 KiB
Talkinghead
Table of Contents
Introduction
This module renders a live, AI-based custom anime avatar for your AI character.
The end result is similar to that generated by VTuber software such as Live2D, but this works differently. We use the THA3 AI posing engine, which takes a single static image of the character as input. It can vary the character's expression, and pose some joints by up to 15 degrees. Modern GPUs have enough compute to do this in realtime.
This has some implications:
- You can produce new characters in a fast and agile manner.
- One expression is enough. No need to make 28 manually.
- If you need to modify some details in the character's outfit, just edit the image (either manually, or by Stable Diffusion/ControlNet).
- We can produce parametric animation on the fly, just like from a traditional 2D or 3D model - but the model is a generative AI.
As with any AI technology, there are limitations. The AI-generated output image may not be perfect, and in particular the model does not support characters wearing large hats or props. For details (and example outputs), refer to the original author's tech report.
Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here; but if you have the hardware, try it, you might like it. If you prefer still images, and don't create new characters often, you may get better results by inpainting expression sprites in Stable Diffusion.
Live mode
The live mode is activated by:
- Loading the
talkingheadmodule in SillyTavern-extras, and - In SillyTavern settings, checking the checkbox Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras).
- Your character must have a
SillyTavern/public/characters/yourcharacternamehere/talkinghead.pngfor this to work. You can upload one in the settings.
CUDA (SillyTavern-extras option --talkinghead-gpu) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the separable_half THA3 model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H.
We rate-limit the output to 25 FPS (maximum) to avoid DoSing the SillyTavern GUI, and attempt to reach a constant 25 FPS. If the renderer runs faster, the average GPU usage will be lower, because the animation engine only generates as many frames as are actually consumed. If the renderer runs slower, the latest available frame will be re-sent as many times as needed, to isolate the client side from any render hiccups.
To customize which THA3 model to use, and where to install the THA3 models from, see the --talkinghead-model=... and --talkinghead-models=... options, respectively.
If the directory talkinghead/tha3/models/ (under the top level of SillyTavern-extras) does not exist, the model files are automatically downloaded from HuggingFace and installed there.
Manual poser
This is a standalone wxPython app that you can run locally on the machine where you installed SillyTavern-extras. It is based on the original manual poser app in the THA3 tech demo, but this version has some important new convenience features and usability improvements.
It uses the same models as the live mode. If the directory talkinghead/tha3/models/ (under the top level of SillyTavern-extras) does not exist, the model files are automatically downloaded from HuggingFace and installed there.
With this app, you can:
- Graphically edit the emotion templates used by the live mode.
- They are JSON files, found in
talkinghead/emotions/under your SillyTavern-extras folder.- The GUI also has a dropdown to quickload any preset.
- NEVER delete or modify
_defaults.json. That file stores the factory settings, and the app will not run without it. - For blunder recovery: to reset an emotion back to its factory setting, see the
--factory-reset=EMOTIONoption, which will use the factory settings to overwrite the corresponding emotion preset JSON. To reset all emotion presets to factory settings, see--factory-reset-all. Careful, these operations cannot be undone!- Currently, these options do NOT regenerate the example images also provided in
talkinghead/emotions/.
- Currently, these options do NOT regenerate the example images also provided in
- They are JSON files, found in
- Batch-generate the 28 static expression sprites for a character.
- Input is the same single static image format as used by the live mode.
- You can then use the generated images as the static expression sprites for your AI character. No need to run the live mode.
To run the manual poser:
- Open a terminal in your
talkingheadsubdirectory conda activate extraspython -m tha3.app.manual_poser.- For systems with
bash, a convenience wrapper./start_manual_poser.shis included. Run the poser with the--helpoption for a description of its command-line options. The command-line options of the manual poser are completely independent from the options of SillyTavern-extras itself.
- For systems with
Currently, you can choose the device to run on (GPU or CPU), and which THA3 model to use. By default, the manual poser uses GPU and the separable_float model.
GPU mode gives the best response, but CPU mode (~2 FPS) is useful at least for batch-exporting static sprites when your VRAM is already full of AI.
To load a PNG image or emotion JSON, you can either use the buttons, their hotkeys, or drag'n'drop a PNG or JSON file from your favorite file manager into the source image pane.
Troubleshooting
Missing model at startup
The separable_float variant of the THA3 models was previously included in the SillyTavern-extras repository. However, this was recently (December 2023) changed to download these models from HuggingFace if necessary, so a local copy of the model is no longer provided.
Therefore, if you updated your SillyTavern-extras installation from git, it is likely that git deleted your local copy of that particular model, leading to an error message like:
FileNotFoundError: Model file /home/xxx/SillyTavern-extras/talkinghead/tha3/models/separable_float/eyebrow_decomposer.pt not found, please check the path.
The solution is to remove (or rename) your SillyTavern-extras/talkinghead/tha3/models directory, and try again. If that directory does not exist, talkinghead will download the models at the first run.
The models are shared between the live mode and the manual poser, so it doesn't matter which one you run first.
Creating a character
To create an AI avatar that talkinghead understands:
- The image must be of size 512x512, in PNG format.
- The image must have an alpha channel.
- Any pixel with nonzero alpha is part of the character.
- If the edges of the silhouette look like a cheap photoshop job, check them manually for background bleed.
- Using any method you prefer, create a front view of your character within these specifications.
- In practice, you can create an image of the character in the correct pose first, and align it as a separate step.
- If you use Stable Diffusion, see separate section below.
- To add an alpha channel to an image that has the character otherwise fine, but on a background:
- In Stable Diffusion, you can try the rembg extension for Automatic1111 to get a rough first approximation.
- Also, you can try the Fuzzy Select (magic wand) tool in traditional image editors such as GIMP or Photoshop.
- Manual pixel-per-pixel editing of edges is recommended for best results. Takes about 20 minutes per character.
- If you rendered the character on a light background, use a dark background layer when editing the edges, and vice versa.
- This makes it much easier to see which pixels have background bleed and need to be erased.
- Finally, align the character on the canvas.
- We recommend using the THA3 example character as an alignment template.
- IMPORTANT: Export the final edited image, without any background layer, as a PNG with an alpha channel.
- Load up the result into SillyTavern as a
talkinghead.png, and see how well it performs.
Tips for Stable Diffusion
It is possible to create a suitable character render with Stable Diffusion. We assume that you already have a local installation of the Automatic1111 webui.
- Don't initially worry about the alpha channel. You can add that after you have generated the image.
- Try the various VTuber checkpoints floating around the Internet.
- These are trained on talking anime heads in particular, so it's much easier getting a pose that works as input for THA3.
- Many human-focused SD checkpoints render best quality at 512x768 (portrait). You can always crop the image later.
- I've had good results with
meina-pro-mistoon-hll3.- It can produce good quality anime art (that looks like it came from an actual anime), and it knows how to pose a talking head.
- It's capable of NSFW so be careful. Use the negative prompt appropriately.
- As the VAE, the standard
vae-ft-mse-840000-ema-pruned.ckptis fine. - Settings: 512x768, 20 steps, DPM++ 2M Karras, CFG scale 7.
- Optionally, you can use the Dynamic Thresholding (CFG Scale Fix) extension for Automatic1111 to render the image at CFG 15 (to increase the chances of SD following the prompt correctly), but make the result look like as if it was rendered at CFG 7.
- Recommended settings: Half Cosine Up, minimum CFG scale 3, mimic CFG scale 7, all else at default values.
- Expect to render upwards of a hundred txt2img gens to get one result good enough for further refinement. (At least you can produce and triage them quickly.)
- Make it easy for yourself to find and fix the edges.
- If your character's outline consists mainly of dark colors, ask for a light background, and vice versa.
- As always with SD, some unexpected words may generate undesirable elements that are impossible to get rid of.
- For example, I wanted an AI character wearing a "futuristic track suit", but SD interpreted the "futuristic" to mean that the character should be posed on a background containing unrelated scifi tech greebles, or worse, that the result should look something like the female lead of Saikano (2002). Removing that word solved it, but did change the outfit style, too.
Prompt:
(front view, symmetry:1.2), ...character description here..., standing, arms at sides, open mouth, smiling,
simple white background, single-color white background, (illustration, 2d, cg, masterpiece:1.2)
The front view and symmetry, appropriately weighted and placed at the beginning, greatly increase the chances of actually getting a direct front view.
Negative prompt:
(three quarters view, detailed background:1.2), full body shot, (blurry, sketch, 3d, photo:1.2),
...character-specific negatives here..., negative_hand-neg, verybadimagenegative_v1.3
As usual, the negative embeddings can be found on Civitai (negative_hand-neg, verybadimagenegative_v1.3)
Then just test it, and equip the negative prompt with NSFW terms if needed.
The camera angle terms in the prompt may need some experimentation. Above, we put full body shot in the negative prompt, because in SD 1.5, at least with many anime models, full body shots often get a garbled face. However, a full body shot can actually be useful here, because it has the legs available so you can crop them at whatever point they need to be cropped to align the character's face with the template.
One possible solution is to ask for a full body shot, and txt2img for a good pose and composition only, no matter the face. Then img2img the result, using the ADetailer extension for Automatic1111 (0.75 denoise, with ControlNet inpaint enabled) to fix the face.
ADetailer notes
- Some versions of ADetailer may fail to render anything into the final output image if the main denoise is set to 0, no matter the ADetailer denoise setting.
- To work around this, use a small value for the main denoise (0.05) to force it to render, without changing the rest of the image too much.
- When inpainting, the inpaint mask must cover the whole area that contains the features to be detected. Otherwise ADetailer will start to process correctly, but since the inpaint mask doesn't cover the area to be edited, it can't write there in the final output image.
- This makes sense in hindsight: when inpainting, the area to be edited must be masked. It doesn't matter how the inpainted image data is produced.
Acknowledgements
This software incorporates the THA3 AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license.
The manual poser code has been mostly rewritten, and the live mode code is original to this software.