Squashed commit of the following:

- Add experimental visual postproc chain (lo-fi scifi hologram) - TODO: not configurable yet, it's always on until I fix this - Improve emotion preset loading logic - Even if an emotion preset JSON is missing, load the emotion from _defaults.json. - Add blunder recovery (emotion preset factory reset) options to manual poser - Fix factory-default preset name angry -> anger - Manual poser: return nonzero exit code on init error - Manual poser too now auto-installs THA3 models if needed - Move TODO list into its own file, dump everything there - Add a README for the new revised talkinghead
2026-05-01 03:41:24 +00:00 · 2023-12-25 02:08:10 +02:00
parent 08a18dc506
commit 6558edb97f
7 changed files with 388 additions and 46 deletions
--- a/talkinghead/README.md
+++ b/talkinghead/README.md
@@ -0,0 +1,163 @@
+## Talkinghead
+
+<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
+**Table of Contents**
+
+- [Talkinghead](#talkinghead)
+    - [Introduction](#introduction)
+    - [Live mode](#live-mode)
+    - [Manual poser](#manual-poser)
+    - [Creating a character](#creating-a-character)
+    - [Tips for Stable Diffusion](#tips-for-stable-diffusion)
+    - [Acknowledgements](#acknowledgements)
+
+<!-- markdown-toc end -->
+
+### Introduction
+
+This module renders a **live, AI-based custom anime avatar for your AI character**.
+
+The end result is similar to that generated by VTuber software such as *Live2D*, but this works differently. We use the THA3 AI posing engine, which takes **a single static image** of the character as input. It can vary the character's expression, and pose some joints by up to 15 degrees. Modern GPUs have enough compute to do this in realtime.
+
+This has some implications:
+
+- You can produce new characters in a fast and agile manner.
+  - One expression is enough. No need to make 28 manually.
+  - If you need to modify some details in the character's outfit, just edit the image (either manually, or by Stable Diffusion/ControlNet).
+- We can produce parametric animation on the fly, just like from a traditional 2D or 3D model - but the model is a generative AI.
+
+As with any AI technology, there are limitations. The AI-generated output image may not be perfect, and in particular the model does not support characters wearing large hats or props. For details (and example outputs), refer to the original author's [tech report](https://web.archive.org/web/20220606125507/https://pkhungurn.github.io/talking-head-anime-3/full.html).
+
+Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here; but if you have the hardware, try it, you might like it. If you prefer still images, and don't create new characters often, you may get better results by inpainting expression sprites in Stable Diffusion.
+
+
+### Live mode
+
+The live mode is activated by:
+
+- Loading the `talkinghead` module in *SillyTavern-extras*, and
+- In *SillyTavern* settings, checking the checkbox *Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras)*.
+- Your character must have a `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png` for this to work. You can upload one in the settings.
+
+CUDA (*SillyTavern-extras option* `--talkinghead-gpu`) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the `separable_half` THA3 model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H.
+
+We rate-limit the output to 25 FPS (maximum) to avoid DoSing the SillyTavern GUI, and attempt to reach a constant 25 FPS. If the renderer runs faster, the average GPU usage will be lower, because the animation engine only generates as many frames as are actually consumed. If the renderer runs slower, the latest available frame will be re-sent as many times as needed, to isolate the client side from any render hiccups.
+
+To customize which THA3 model to use, and where to install the THA3 models from, see the `--talkinghead-model=...` and `--talkinghead-models=...` options, respectively.
+
+If the directory `talkinghead/tha3/models/` (under the top level of *SillyTavern-extras*) does not exist, the model files are automatically downloaded from HuggingFace and installed there.
+
+
+### Manual poser
+
+This is a standalone wxPython app that you can run locally on the machine where you installed *SillyTavern-extras*. It is based on the original manual poser app in the THA3 tech demo, but this version has some important new convenience features and usability improvements.
+
+It uses the same models as the live mode. If the directory `talkinghead/tha3/models/` (under the top level of *SillyTavern-extras*) does not exist, the model files are automatically downloaded from HuggingFace and installed there.
+
+With this app, you can:
+
+- **Graphically edit the emotion templates** used by the live mode.
+  - They are JSON files, found in `talkinghead/emotions/` under your *SillyTavern-extras* folder.
+    - The GUI also has a dropdown to quickload any preset.
+  - **NEVER** delete or modify `_defaults.json`. That file stores the factory settings, and the app will not run without it.
+  - For blunder recovery: to reset an emotion back to its factory setting, see the `--factory-reset=EMOTION` option, which will use the factory settings to overwrite the corresponding emotion preset JSON. To reset **all** emotion presets to factory settings, see `--factory-reset-all`. Careful, these operations **cannot** be undone!
+    - Currently, these options do **NOT** regenerate the example images also provided in `talkinghead/emotions/`.
+- **Batch-generate the 28 static expression sprites** for a character.
+  - Input is the same single static image format as used by the live mode.
+  - You can then use the generated images as the static expression sprites for your AI character. No need to run the live mode.
+
+To run the manual poser:
+
+- Open a terminal in your `talkinghead` subdirectory
+- `conda activate extras`
+- `python -m tha3.app.manual_poser`.
+  - For systems with `bash`, a convenience wrapper `./start_manual_poser.sh` is included.
+Run the poser with the `--help` option for a description of its command-line options. The command-line options of the manual poser are **completely independent** from the options of *SillyTavern-extras* itself.
+
+Currently, you can choose the device to run on (GPU or CPU), and which THA3 model to use. By default, the manual poser uses GPU and the `separable_float` model.
+
+GPU mode gives the best response, but CPU mode (~2 FPS) is useful at least for batch-exporting static sprites when your VRAM is already full of AI.
+
+To load a PNG image or emotion JSON, you can either use the buttons, their hotkeys, or **drag'n'drop a PNG or JSON** file from your favorite file manager into the source image pane.
+
+
+### Creating a character
+
+To create an AI avatar that `talkinghead` understands:
+
+- The image must be of size 512x512, in PNG format.
+- **The image must have an alpha channel**.
+  - Any pixel with nonzero alpha is part of the character.
+  - If the edges of the silhouette look like a cheap photoshop job, check them manually for background bleed.
+- Using any method you prefer, create a front view of your character within [these specifications](Character_Card_Guide.png).
+  - In practice, you can create an image of the character in the correct pose first, and align it as a separate step.
+  - If you use Stable Diffusion, see separate section below.
+- To add an alpha channel to an image that has the character otherwise fine, but on a background:
+  - In Stable Diffusion, you can try the [rembg](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) extension for Automatic1111 to get a rough first approximation.
+  - Also, you can try the *Fuzzy Select* (magic wand) tool in traditional image editors such as GIMP or Photoshop.
+  - Manual pixel-per-pixel editing of edges is recommended for best results. Takes about 20 minutes per character.
+    - If you rendered the character on a light background, use a dark background layer when editing the edges, and vice versa.
+    - This makes it much easier to see which pixels have background bleed and need to be erased.
+- Finally, align the character on the canvas.
+  - We recommend using [the THA3 example character](tha3/images/example.png) as an alignment template.
+  - **IMPORTANT**: Export the final edited image, *without any background layer*, as a PNG with an alpha channel.
+- Load up the result into *SillyTavern* as a `talkinghead.png`, and see how well it performs.
+
+### Tips for Stable Diffusion
+
+It is possible to create a suitable character render with Stable Diffusion. We assume that you already have a local installation of the [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) webui.
+
+- Don't initially worry about the alpha channel. You can add that after you have generated the image.
+- Try the various VTuber checkpoints floating around the Internet.
+  - These are trained on talking anime heads in particular, so it's much easier getting a pose that works as input for THA3.
+  - Many human-focused SD checkpoints render best quality at 512x768 (portrait). You can always crop the image later.
+- I've had good results with `meina-pro-mistoon-hll3`.
+  - It can produce good quality anime art (that looks like it came from an actual anime), and it knows how to pose a talking head.
+  - It's capable of NSFW so be careful. Use the negative prompt appropriately.
+  - As the VAE, the standard `vae-ft-mse-840000-ema-pruned.ckpt` is fine.
+  - Settings: *512x768, 20 steps, DPM++ 2M Karras, CFG scale 7*.
+  - Optionally, you can use the [Dynamic Thresholding (CFG Scale Fix)](https://github.com/mcmonkeyprojects/sd-dynamic-thresholding) extension for Automatic1111 to render the image at CFG 15 (to increase the chances of SD following the prompt correctly), but make the result look like as if it was rendered at CFG 7.
+    - Recommended settings: *Half Cosine Up, minimum CFG scale 3, mimic CFG scale 7*, all else at default values.
+- Expect to render **upwards of a hundred** *txt2img* gens to get **one** result good enough for further refinement. (At least you can produce and triage them quickly.)
+- **Make it easy for yourself to find and fix the edges.**
+  - If your character's outline consists mainly of dark colors, ask for a light background, and vice versa.
+- As always with SD, some unexpected words may generate undesirable elements that are impossible to get rid of.
+  - For example, I wanted an AI character wearing a *"futuristic track suit"*, but SD interpreted the *"futuristic"* to mean that the character should be posed on a background containing unrelated scifi tech greebles, or worse, that the result should look something like the female lead of [*Saikano* (2002)](https://en.wikipedia.org/wiki/Saikano). Removing that word solved it, but did change the outfit style, too.
+
+**Prompt**:
+
+```
+(front view, symmetry:1.2), ...character description here..., standing, arms at sides, open mouth, smiling,
+simple white background, single-color white background, (illustration, 2d, cg, masterpiece:1.2)
+```
+
+The `front view` and `symmetry`, appropriately weighted and placed at the beginning, greatly increase the chances of actually getting a direct front view.
+
+**Negative prompt**:
+
+```
+(three quarters view, detailed background:1.2), full body shot, (blurry, sketch, 3d, photo:1.2),
+...character-specific negatives here..., negative_hand-neg, verybadimagenegative_v1.3
+```
+
+As usual, the negative embeddings can be found on [Civitai](https://civitai.com/) ([negative_hand-neg](https://civitai.com/models/56519), [verybadimagenegative_v1.3](https://civitai.com/models/11772))
+
+Then just test it, and equip the negative prompt with NSFW terms if needed.
+
+The camera angle terms in the prompt may need some experimentation. Above, we put `full body shot` in the negative prompt, because in SD 1.5, at least with many anime models, full body shots often get a garbled face. However, a full body shot can actually be useful here, because it has the legs available so you can crop them at whatever point they need to be cropped to align the character's face with the template.
+
+One possible solution is to ask for a `full body shot`, and *txt2img* for a good pose and composition only, no matter the face. Then *img2img* the result, using the [ADetailer](https://github.com/Bing-su/adetailer) extension for Automatic1111 (0.75 denoise, with [ControlNet inpaint](https://stable-diffusion-art.com/controlnet/#ControlNet_Inpainting) enabled) to fix the face.
+
+**ADetailer notes**
+
+- Some versions of ADetailer may fail to render anything into the final output image if the main denoise is set to 0, no matter the ADetailer denoise setting.
+  - To work around this, use a small value for the main denoise (0.05) to force it to render, without changing the rest of the image too much.
+- When inpainting, **the inpaint mask must cover the whole area that contains the features to be detected**. Otherwise ADetailer will start to process correctly, but since the inpaint mask doesn't cover the area to be edited, it can't write there in the final output image.
+  - This makes sense in hindsight: when inpainting, the area to be edited must be masked. It doesn't matter how the inpainted image data is produced.
+
+
+### Acknowledgements
+
+This software incorporates the [THA3](https://github.com/pkhungurn/talking-head-anime-3-demo) AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license.
+
+The manual poser code has been mostly rewritten, and the live mode code is original to this software.