update talkinghead README

2026-05-01 03:41:24 +00:00 · 2024-01-14 04:04:57 +02:00
parent cd7fce11ca
commit 1c283b8afd
1 changed files with 63 additions and 34 deletions
--- a/talkinghead/README.md
+++ b/talkinghead/README.md
@@ -16,9 +16,11 @@
        - [Complete example: animator and postprocessor settings](#complete-example-animator-and-postprocessor-settings)
    - [Manual poser](#manual-poser)
    - [Troubleshooting](#troubleshooting)
+        - [Low framerate](#low-framerate)
+        - [Low VRAM - what to do?](#low-vram---what-to-do)
        - [Missing model at startup](#missing-model-at-startup)
    - [Creating a character](#creating-a-character)
-    - [Tips for Stable Diffusion](#tips-for-stable-diffusion)
+        - [Tips for Stable Diffusion](#tips-for-stable-diffusion)
    - [Acknowledgements](#acknowledgements)

 <!-- markdown-toc end -->
@@ -27,60 +29,66 @@

 This module renders a **live, AI-based custom anime avatar for your AI character**.

-The end result is similar to that generated by VTuber software such as *Live2D*, but this works differently. We use the THA3 AI posing engine, which takes **a single static image** of the character as input. It can vary the character's expression, and pose some joints by up to 15 degrees. Modern GPUs have enough compute to do this in realtime.
+In contrast to VTubing software, `talkinghead` is an **AI-based** character animation technology, which produces animation from just **one static 2D image**. This makes creating new characters accessible and cost-effective. All you need is Stable Diffusion and an image editor to get started! Additionally, you can experiment with your character's appearance in an agile way, animating each revision of your design.

-This has some implications:
+The animator is built on top of a deep learning model, so optimal performance requires a fast GPU. The model can vary the character's expression, and pose some joints by up to 15 degrees. This allows producing parametric animation on the fly, just like from a traditional 2D or 3D model - but from a small generative AI. Modern GPUs have enough compute to do this in realtime.

- You can produce new characters in a fast and agile manner.
-  - One expression is enough. No need to make 28 manually.
-  - If you need to modify some details in the character's outfit, just edit the image (either manually, or by Stable Diffusion/ControlNet).
- We can produce parametric animation on the fly, just like from a traditional 2D or 3D model - but the model is a generative AI.
+You only need to provide **one** expression for your character. The model automatically generates the rest of the 28, and seamlessly animates between them. The expressions are based on *emotion templates*, which are essentially just morph settings. To make it convenient to edit the templates, we provide a GUI editor (the manual poser), where you can see how the resulting expression looks on your character.

-As with any AI technology, there are limitations. The AI-generated output image may not be perfect, and in particular the model does not support characters wearing large hats or props. For details (and example outputs), refer to the original author's [tech report](https://web.archive.org/web/20220606125507/https://pkhungurn.github.io/talking-head-anime-3/full.html).
+As with any AI technology, there are limitations. The AI-generated animation frames may not look perfect, and in particular the model does not support characters wearing large hats or props. For details (and many example outputs), refer to the [tech report](https://web.archive.org/web/20220606125507/https://pkhungurn.github.io/talking-head-anime-3/full.html) by the model's original author.

-Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here; but if you have the hardware, try it, you might like it. If you prefer still images, and don't create new characters often, you may get better results by inpainting expression sprites in Stable Diffusion.
+Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here. If you have the hardware, try it, you might like it. Especially, if you like to make new characters, or to tweak your character design often, this is the animator for you. On the other hand, if you prefer still images, and focus on one particular design, you may get more aesthetically pleasing results by inpainting static expression sprites in Stable Diffusion.
+
+Currently, `talkinghead` is focused on providing 1-on-1 interactions with your AI character, so support for group chats and visual novel mode are not included, nor planned. However, as a community-driven project, we appreciate any feedback or especially code or documentation contributions towards the growth and development of this extension.


 ### Live mode

-The live mode is activated by:
+To activate the live mode:

- Loading the `talkinghead` module in *SillyTavern-extras*, and
- In *SillyTavern* settings, checking the checkbox *Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras)*.
- Your character must have a `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png` for this to work. You can upload one in the settings.
+- Configure your *SillyTavern-extras* installation so that it loads the `talkinghead` module. This makes the backend available.
+- Ensure that your character has a `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png`. This is the input image for the animator.
+  - You can upload one in the *SillyTavern* settings, in *Extensions ⊳ Character Expressions*.
+- To enable **talkinghead mode** in *Character Expressions*, check the checkbox *Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras)*.

-CUDA (*SillyTavern-extras option* `--talkinghead-gpu`) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the `separable_half` THA3 model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H.
+CUDA (*SillyTavern-extras* option `--talkinghead-gpu`) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the `separable_half` THA3 poser model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H.

 We rate-limit the output to 25 FPS (maximum) to avoid DoSing the SillyTavern GUI, and attempt to reach a constant 25 FPS. If the renderer runs faster, the average GPU usage will be lower, because the animation engine only generates as many frames as are actually consumed. If the renderer runs slower, the latest available frame will be re-sent as many times as needed, to isolate the client side from any render hiccups.

-To customize which THA3 model to use, and where to install the THA3 models from, see the `--talkinghead-model=...` and `--talkinghead-models=...` options, respectively.
+To customize which model variant of the THA3 poser to use, and where to install the models from, see the `--talkinghead-model=...` and `--talkinghead-models=...` options, respectively.

 If the directory `talkinghead/tha3/models/` (under the top level of *SillyTavern-extras*) does not exist, the model files are automatically downloaded from HuggingFace and installed there.

 #### Configuration

-The live mode is optionally configurable per-character. This is currently done via JSON files in `SillyTavern/public/characters/yourcharacternamehere/`. Specifically:
+The live mode is configured per-character, via files **at the client end**:

- `_animator.json`: animator and postprocessor settings.
- `_emotions.json`: custom emotion templates. (Note the leading underscore in the filename.)
+- `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png`: required. The input image for the animator.
+  - The `talkinghead` extension does not use or even see the other `.png` files. They are used by *Character Expressions* when *talkinghead mode* is disabled.
+- `SillyTavern/public/characters/yourcharacternamehere/_animator.json`: optional. Animator and postprocessor settings.
+  - If a character does not have this file, default settings are used.
+- `SillyTavern/public/characters/yourcharacternamehere/_emotions.json`: optional. Custom emotion templates.
+  - If a character does not have this file, default settings are used. Most of the time, there is no need to customize the emotion templates per-character.
+  - At the client end, only this one file is needed (or even supported) to customize the emotion templates.

 #### Emotion templates

-Emotion templates use the same format as the factory settings in `SillyTavern-extras/talkinghead/emotions/_defaults.json`. The manual poser app included with `talkinghead` can be used as a GUI editor for these templates. Especially, the batch export produces `_emotions.json` in your chosen output folder.
+Emotion templates use the same format as the factory settings in `SillyTavern-extras/talkinghead/emotions/_defaults.json`. The manual poser app included with `talkinghead` is a GUI editor for these templates.
+
+The batch export of the manual poser produces a set of static expression images (and corresponding emotion templates), but also an `_emotions.json`, in your chosen output folder. You can use this file at the client end as `SillyTavern/public/characters/yourcharacternamehere/_emotions.json`. This is convenient if you have customized your emotion templates, and wish to share one of your characters with other users, making it automatically use your version of the templates.

 Emotion template lookup order is:

 - The set of custom templates sent by the ST client, read from `SillyTavern/public/characters/yourcharacternamehere/_emotions.json` if it exists.
-  - This is completely optional - if you think the defaults are fine, there's no need to create this file.
 - Server defaults, from the individual files `SillyTavern-extras/talkinghead/emotions/emotionnamehere.json`.
-  - These are technically customizable. You can e.g. overwrite `curiosity.json` to change the default template for the *"curiosity"* emotion.
-  - **IMPORTANT**: *However, updating SillyTavern-extras from git may overwrite your changes to the default emotion templates.*
+  - These are customizable. You can e.g. overwrite `curiosity.json` to change the default template for the *"curiosity"* emotion.
+  - **IMPORTANT**: *However, updating SillyTavern-extras from git may overwrite your changes to the server-side default emotion templates.*
 - Factory settings, from `SillyTavern-extras/talkinghead/emotions/_defaults.json`.
  - **IMPORTANT**: Never overwrite or remove this file.

 Any emotion that is missing from a particular level in the lookup order falls through to be looked up at the next level.

-If you want to edit the emotion templates manually for some reason, the following may be useful sources of information:
+If you want to edit the emotion templates manually (without using the GUI) for some reason, the following may be useful sources of information:

 - `posedict_keys` in [`talkinghead/tha3/app/util.py`](tha3/app/util.py) lists the morphs available in THA3.
 - [`talkinghead/tha3/poser/modes/pose_parameters.py`](tha3/poser/modes/pose_parameters.py) contains some more detail.
@@ -94,7 +102,7 @@ Any morph that is not mentioned for a particular emotion defaults to zero. Thus

 *The available settings keys and examples are kept up-to-date on a best-effort basis, but there is a risk of this documentation being out of date. When in doubt, refer to the actual source code, which comes with extensive docstrings and comments. The final authoritative source is the implementation itself.*

-The file `SillyTavern/public/characters/yourcharacternamehere/_animator.json` contains the animator and postprocessor settings. It is always per-character, and optional. If the file does not exist, the default settings are used. For any setting not specified in the file, the default value is used.
+The file `SillyTavern/public/characters/yourcharacternamehere/_animator.json` contains the animator and postprocessor settings. For any setting not specified in the file, the default value is used.

 The idea is that this allows giving some personality to different characters; for example, they may sway by different amounts, the breathing cycle duration may be different, and importantly, the postprocessor settings may be different - which allows e.g. making a specific character into a scifi hologram, while others render normally.

@@ -120,7 +128,7 @@ Here is a complete example of `_animator.json`, showing the default values:

 where:

- `target_fps`: Desired output frames per second. Note this only affects smoothness of the output (if hardware allows). The speed at which the animation evolves is based on wall time. Snapshots are rendered at the target FPS, or if the hardware is too slow to reach the target FPS, then as often as hardware allows. *Recommendation*: For smooth animation, make the FPS lower than what your hardware could produce, so that some compute remains untapped, available to smooth over the occasional hiccup from other running programs.
+- `target_fps`: Desired output frames per second. Note this only affects smoothness of the output (provided that the hardware is fast enough). The speed at which the animation evolves is based on wall time. Snapshots are rendered at the target FPS, or if the hardware is slower, then as often as hardware allows. *Recommendation*: For smooth animation, make the FPS lower than what your hardware could produce, so that some compute remains untapped, available to smooth over the occasional hiccup from other running programs.
 - `pose_interpolator_step`: A value such that `0 < step <= 1`. Applied at each frame at a reference of 25 FPS (to standardize the meaning of the setting), with automatic internal FPS-correction to the actual output FPS. Note that the animation is nonlinear. The step controls how much of the *remaining distance* to the current target pose is covered in 1/25 seconds.
 - `blink_interval_min`: seconds. After blinking, lower limit for random minimum time until next blink is allowed.
 - `blink_interval_max`: seconds. After blinking, upper limit for random minimum time until next blink is allowed.
@@ -142,6 +150,8 @@ where:

 The postprocessor configuration is part of `_animator.json`, stored under the key `"postprocessor_chain"`.

+Postprocessing requires some additional compute, depending on the filters used and their settings. When `talkinghead` runs on the GPU, also the postprocessor filters run on the GPU. In gaming technology terms, they are essentially fragment shaders, implemented in PyTorch.
+
 The filters in the postprocessor chain are applied to the image in the order in which they appear in the list. That is, the filters themselves support rendering in any order. However, for best results, it is useful to keep in mind the process a real physical signal would travel through:

 *Light* ⊳ *Camera* ⊳ *Transport* ⊳ *Display*
@@ -173,6 +183,8 @@ Currently, we provide some filters that simulate a lo-fi analog video look.
 - `translucency`: Makes the character translucent, as if a scifi hologram.
 - `banding`: Simulates the look of a CRT display as it looks when filmed on video without syncing. Brighter and darker bands travel through the image.
 - `scanlines`: Simulates CRT TV like scanlines. Optionally dynamic (flipping the dimmed field at each frame).
+  - From my experiments with the Phosphor deinterlacer in VLC, which implements the same effect, dynamic mode for `scanlines` would look *absolutely magical* when synchronized with display refresh, closely reproducing the look of an actual CRT TV. However, that is not possible here. Thus, it looks best at low but reasonable FPS, and a very high display refresh rate, so that small timing variations will not make much of a difference in how long a given field is actually displayed on the physical monitor.
+  - If the timing is too uneven, the illusion breaks. In that case, consider using the static mode (`"dynamic": false`).

 **General use**:

@@ -215,7 +227,7 @@ The `banding` and `scanlines` filters suit this look, so we apply them here, too

 #### Postprocessor example: HDR, cheap video camera, 1980s VHS tape

-After capturing the light with a cheap video camera (just like in the previous example), we simulate the effects of transporting the signal on a 1980s VHS tape. First, we blur the image with `analog_lowres`. Then we apply `alphanoise` with a nonzero `sigma` to make the noise blobs larger than a single pixel. This simulates the brightness noise on a VHS tape. Then we make the image ripple horizontally with `analog_badhsync`, and finally add a bad VHS tracking effect to complete the look.
+After capturing the light with a cheap video camera (just like in the previous example), we simulate the effects of transporting the signal on a 1980s VHS tape. First, we blur the image with `analog_lowres`. Then we apply `alphanoise` with a nonzero `sigma` to make the noise blobs larger than a single pixel, and a rather high `magnitude`. This simulates the brightness noise on a VHS tape. Then we make the image ripple horizontally with `analog_badhsync`, and finally add a bad VHS tracking effect to complete the look.

 Then we again render the output on a simulated CRT TV, as appropriate for the 1980s time period.

@@ -299,9 +311,23 @@ To load a PNG image or emotion JSON, you can either use the buttons, their hotke

 ### Troubleshooting

+#### Low framerate
+
+The poser is a deep-learning model. Each animation frame requires an inference pass. This requires lots of compute.
+
+Thus, if you have a CUDA-capable GPU, enable GPU support by using the `--talkinghead-gpu` setting of *SillyTavern-extras*.
+
+CPU mode is very slow, and without a redesign of the AI model (or distillation, like in the newer [THA4 paper](https://arxiv.org/abs/2311.17409)), there is not much that can be done. It is already running as fast as PyTorch can go, and the performance impact of everything except the posing engine is almost negligible.
+
+#### Low VRAM - what to do?
+
+Observe that the `--talkinghead-gpu` setting is independent of the CUDA device setting of the rest of *SillyTavern-extras*.
+
+So in a low-VRAM environment such as a gaming laptop, you can run just `talkinghead` on the GPU (VRAM usage about 520 MB) to get acceptable animation performance, while running all other extras modules on the CPU. The `classify` or `summarize` AI modules do not require realtime performance, whereas `talkinghead` does.
+
 #### Missing model at startup

-The `separable_float` variant of the THA3 models was previously included in the *SillyTavern-extras* repository. However, this was recently (December 2023) changed to download these models from HuggingFace if necessary, so a local copy of the model is no longer provided.
+The `separable_float` variant of the THA3 models was previously included in the *SillyTavern-extras* repository. However, `talkinghead` was recently (December 2023) changed to download these models from HuggingFace if necessary, so a local copy of the model is no longer provided in the repository.

 Therefore, if you updated your *SillyTavern-extras* installation from *git*, it is likely that *git* deleted your local copy of that particular model, leading to an error message like:

@@ -309,9 +335,9 @@ Therefore, if you updated your *SillyTavern-extras* installation from *git*, it
 FileNotFoundError: Model file /home/xxx/SillyTavern-extras/talkinghead/tha3/models/separable_float/eyebrow_decomposer.pt not found, please check the path.
 ```

-The solution is to remove (or rename) your `SillyTavern-extras/talkinghead/tha3/models` directory, and try again. If that directory does not exist, `talkinghead` will download the models at the first run.
+The solution is to remove (or rename) your `SillyTavern-extras/talkinghead/tha3/models` directory, and restart *SillyTavern-extras*. If the model directory does not exist, `talkinghead` will download the models at the first run.

-The models are shared between the live mode and the manual poser, so it doesn't matter which one you run first.
+The models are actually shared between the live mode and the manual poser, so it doesn't matter which one you run first.


 ### Creating a character
@@ -321,22 +347,25 @@ To create an AI avatar that `talkinghead` understands:
 - The image must be of size 512x512, in PNG format.
 - **The image must have an alpha channel**.
  - Any pixel with nonzero alpha is part of the character.
-  - If the edges of the silhouette look like a cheap photoshop job, check them manually for background bleed.
+  - If the edges of the silhouette look like a cheap photoshop job (especially when ST renders the character on a different background), check them manually for background bleed.
 - Using any method you prefer, create a front view of your character within [these specifications](Character_Card_Guide.png).
  - In practice, you can create an image of the character in the correct pose first, and align it as a separate step.
  - If you use Stable Diffusion, see separate section below.
+  - **IMPORTANT**: *The character's eyes and mouth must be open*, so that the model sees what they look like when open.
+    - See [the THA3 example character](tha3/images/example.png).
+    - If that's easier to produce, an open-mouth smile also works.
 - To add an alpha channel to an image that has the character otherwise fine, but on a background:
  - In Stable Diffusion, you can try the [rembg](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) extension for Automatic1111 to get a rough first approximation.
  - Also, you can try the *Fuzzy Select* (magic wand) tool in traditional image editors such as GIMP or Photoshop.
  - Manual pixel-per-pixel editing of edges is recommended for best results. Takes about 20 minutes per character.
    - If you rendered the character on a light background, use a dark background layer when editing the edges, and vice versa.
    - This makes it much easier to see which pixels have background bleed and need to be erased.
- Finally, align the character on the canvas.
+- Finally, align the character on the canvas to conform to the placement the THA3 posing engine expects.
  - We recommend using [the THA3 example character](tha3/images/example.png) as an alignment template.
  - **IMPORTANT**: Export the final edited image, *without any background layer*, as a PNG with an alpha channel.
 - Load up the result into *SillyTavern* as a `talkinghead.png`, and see how well it performs.

-### Tips for Stable Diffusion
+#### Tips for Stable Diffusion

 It is possible to create a suitable character render with Stable Diffusion. We assume that you already have a local installation of the [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) webui.

@@ -391,6 +420,6 @@ One possible solution is to ask for a `full body shot`, and *txt2img* for a good

 ### Acknowledgements

-This software incorporates the [THA3](https://github.com/pkhungurn/talking-head-anime-3-demo) AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license.
+This software incorporates the [THA3](https://github.com/pkhungurn/talking-head-anime-3-demo) AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license. The trained models are currently mirrored [on HuggingFace](https://huggingface.co/OktayAlpk/talking-head-anime-3).

-The manual poser code has been mostly rewritten, and the live mode code is original to this software.
+In this software, the manual poser code has been mostly rewritten, and the live mode code is original to `talkinghead`.