diff --git a/README.md b/README.md index ba50a00..b281eb0 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ There are some folks in the community having success running Extras on their pho We will NOT provide any support for running this on Android. Direct all your questions to the creator of this guide. -#### Talkinghead module on Linux +### Talkinghead module on Linux The manual poser app of `talkinghead` requires the installation of an additional package because it's not installed automatically due to incompatibility with Colab. Run this after you install other requirements: @@ -75,6 +75,8 @@ The manual poser app of `talkinghead` requires the installation of an additional If you only run `talkinghead` in the live mode (i.e. as a SillyTavern-extras module), `wxpython` is no longer required. +The manual poser has two uses. First, it is a GUI editor for the `talkinghead` emotion templates. Secondly, it can batch-generate static emotion sprites from a single talkinghead image (if you want the convenience of AI-powered posing, but don't want to run the live mode). + A fast GPU is heavily recommended. For more information, see the [`talkinghead` README](talkinghead/README.md). ### 💻 Locally @@ -155,7 +157,7 @@ cd SillyTavern-extras | `sd` | Stable Diffusion image generation (remote A1111 server by default) | | `silero-tts` | [Silero TTS server](https://github.com/ouoertheo/silero-api-server) | | `chromadb` | Vector storage server | -| `talkinghead` | Talking Head Sprites | +| `talkinghead` | AI-powered character animation | | `edge-tts` | [Microsoft Edge TTS client](https://github.com/rany2/edge-tts) | | `coqui-tts` | [Coqui TTS server](https://github.com/coqui-ai/TTS) | | `rvc` | Real-time voice cloning | @@ -173,7 +175,9 @@ cd SillyTavern-extras | `--mps` or `--m1` | Run the models on Apple Silicon. Only for M1 and M2 processors. | | `--cuda` | Uses CUDA (GPU+VRAM) to run modules if it is available. Otherwise, falls back to using CPU. | | `--cuda-device` | Specifies a CUDA device to use. Defaults to `cuda:0` (first available GPU). | -| `--talkinghead-gpu` | Uses GPU for talkinghead (10x FPS increase in animation). | +| `--talkinghead-gpu` | Uses GPU for talkinghead (10x FPS increase in animation). | +| `--talkinghead-model` | Load a specific THA3 model variant for talkinghead.
Default: `auto` (which is `separable_half` on GPU, `separable_float` on CPU). | +| `--talkinghead-models` | If THA3 models are not yet installed, downloads and installs them.
Expects a HuggingFace model ID.
Default: [OktayAlpk/talking-head-anime-3](https://huggingface.co/OktayAlpk/talking-head-anime-3) | | `--coqui-gpu` | Uses GPU for coqui TTS (if available). | | `--coqui-model` | If provided, downloads and preloads a coqui TTS model. Default: none.
Example: `tts_models/multilingual/multi-dataset/bark` | | `--summarization-model` | Load a custom summarization model.
Expects a HuggingFace model ID.
Default: [Qiliang/bart-large-cnn-samsum-ChatGPT_v3](https://huggingface.co/Qiliang/bart-large-cnn-samsum-ChatGPT_v3) | @@ -533,34 +537,86 @@ _progress (string, Optional): Show progress bar in terminal. #### **Output** MP3 audio file. -### Loads a talkinghead character by specifying the character's image URL. -`GET /api/talkinghead/load` -#### **Parameters** -loadchar (string, required): The URL of the character's image. The URL should point to a PNG image. -{ "loadchar": "http://localhost:8000/characters/Aqua.png" } +### Load a talkinghead character +`POST /api/talkinghead/load` +#### **Input** +A `FormData` with files, with an image file in a field named `"file"`. The posted file should be a PNG image in RGBA format. Optimal resolution is 512x512. See the [`talkinghead` README](talkinghead/README.md) for details. #### **Example** -'http://localhost:5100/api/talkinghead/load?loadchar=http://localhost:8000/characters/Aqua.png' +'http://localhost:5100/api/talkinghead/load' #### **Output** 'OK' -### Animates the talkinghead sprite to start talking. +### Load talkinghead emotion templates (or reset them to defaults) +`POST /api/talkinghead/load_emotion_templates` +#### **Input** +``` +{"anger": {"eyebrow_angry_left_index": 1.0, + ...} + "curiosity": {"eyebrow_lowered_left_index": 0.5895, + ...} + ...} +``` +For details, see `Animator.load_emotion_templates` in [`talkinghead/tha3/app/app.py`](talkinghead/tha3/app/app.py). This is essentially the format used by [`talkinghead/emotions/_defaults.json`](talkinghead/emotions/_defaults.json). + +Any emotions NOT supplied in the posted JSON will revert to server defaults. In any supplied emotion, any morph NOT supplied will default to zero. This allows making the templates shorter. + +To reset all emotion templates to their server defaults, send a blank JSON. +#### **Output** +"OK" + +### Load talkinghead animator/postprocessor settings (or reset them to defaults) +`POST /api/talkinghead/load_animator_settings` +#### **Input** +``` +{"target_fps": 25, + "breathing_cycle_duration": 4.0, + "postprocessor_chain": [["bloom", {}], + ["chromatic_aberration", {}], + ["vignetting", {}], + ["translucency", {"alpha": 0.9}], + ["alphanoise", {"magnitude": 0.1, "sigma": 0.0}], + ["banding", {}], + ["scanlines", {"dynamic": true}]] + ...} +``` +For a full list of supported settings, see `animator_defaults` and `Animator.load_animator_settings`, both in [`talkinghead/tha3/app/app.py`](talkinghead/tha3/app/app.py). + +Particularly for `"postprocess_chain"`, see [`talkinghead/tha3/app/postprocessor.py`](talkinghead/tha3/app/postprocessor.py). The postprocessor applies pixel-space glitch artistry, which can e.g. make your talkinghead look like a scifi hologram (the above example does this). The postprocessing filters are applied in the order they appear in the list. + +To reset all animator/postprocessor settings to their server defaults, send a blank JSON. +#### **Output** +"OK" + +### Animate the talkinghead character to start talking `GET /api/talkinghead/start_talking` #### **Example** 'http://localhost:5100/api/talkinghead/start_talking' #### **Output** -"started" +"talking started" -### Animates the talkinghead sprite to stop talking. +### Animate the talkinghead character to stop talking `GET /api/talkinghead/stop_talking` #### **Example** 'http://localhost:5100/api/talkinghead/stop_talking' #### **Output** -"stopped" +"talking stopped" -### Outputs the animated talkinghead sprite. +### Set the talkinghead character's emotion +`POST /api/talkinghead/set_emotion` +Available emotions: see `talkinghead/emotions/*.json`. An emotion must be specified, but if it is not available, this operation defaults to `"neutral"`, which must always be available. This endpoint is the backend behind the `/emote` slash command in talkinghead mode. +#### **Input** +``` +{"emotion_name": "curiosity"} +``` +#### **Example** +'http://localhost:5100/api/talkinghead/set_emotion' +#### **Output** +"emotion set to curiosity" + +### Output the animated talkinghead sprite. `GET /api/talkinghead/result_feed` #### **Output** -Animated transparent image +Animated transparent image, each frame a 512x512 PNG image in RGBA format. ### Perform web search `POST /api/websearch` diff --git a/server.py b/server.py index 227fec7..5ffa085 100644 --- a/server.py +++ b/server.py @@ -1,32 +1,39 @@ +#!/usr/bin/python +"""SillyTavern-extras server main program. See `README.md`.""" + +import argparse +import base64 from functools import wraps -from flask import ( - Flask, - jsonify, - request, - Response, - render_template_string, - abort, - send_from_directory, - send_file, -) +import gc +import hashlib +from io import BytesIO +import os +from random import randint +import secrets +import sys +import time +import unicodedata + +from colorama import Fore, Style, init as colorama_init +import markdown + +from PIL import Image + +import torch +from transformers import pipeline + +from flask import (Flask, + jsonify, + request, + Response, + render_template_string, + abort, + send_from_directory, + send_file) from flask_cors import CORS from flask_compress import Compress -import markdown -import argparse -from transformers import pipeline -import unicodedata -import torch -import time -import os -import gc -import sys -import secrets -from PIL import Image -import base64 -from io import BytesIO -from random import randint import webuiapi -import hashlib + from constants import (DEFAULT_SUMMARIZATION_MODEL, DEFAULT_CLASSIFICATION_MODEL, DEFAULT_CAPTIONING_MODEL, @@ -34,7 +41,9 @@ from constants import (DEFAULT_SUMMARIZATION_MODEL, DEFAULT_SD_MODEL, DEFAULT_REMOTE_SD_HOST, DEFAULT_REMOTE_SD_PORT, PROMPT_PREFIX, NEGATIVE_PROMPT, DEFAULT_CUDA_DEVICE, DEFAULT_CHROMA_PORT) -from colorama import Fore, Style, init as colorama_init + +# -------------------------------------------------------------------------------- +# Inits that must run before we proceed any further colorama_init() @@ -42,431 +51,309 @@ if sys.hexversion < 0x030b0000: print(f"{Fore.BLUE}{Style.BRIGHT}Python 3.11 or newer is recommended to run this program.{Style.RESET_ALL}") time.sleep(2) -class SplitArgs(argparse.Action): - def __call__(self, parser, namespace, values, option_string=None): - setattr( - namespace, self.dest, values.replace('"', "").replace("'", "").split(",") - ) - -#Setting Root Folders for Silero Generations so it is compatible with STSL, should not effect regular runs. - Rolyat -parent_dir = os.path.dirname(os.path.abspath(__file__)) -SILERO_SAMPLES_PATH = os.path.join(parent_dir, "tts_samples") -SILERO_SAMPLE_TEXT = os.path.join(parent_dir) - -# Create directories if they don't exist -if not os.path.exists(SILERO_SAMPLES_PATH): - os.makedirs(SILERO_SAMPLES_PATH) -if not os.path.exists(SILERO_SAMPLE_TEXT): - os.makedirs(SILERO_SAMPLE_TEXT) - -# Script arguments -parser = argparse.ArgumentParser( - prog="SillyTavern Extras", description="Web API for transformers models" -) -parser.add_argument( - "--port", type=int, help="Specify the port on which the application is hosted" -) -parser.add_argument( - "--listen", action="store_true", help="Host the app on the local network" -) -parser.add_argument( - "--share", action="store_true", help="Share the app on CloudFlare tunnel" -) -parser.add_argument("--cpu", action="store_true", help="Run the models on the CPU") -parser.add_argument("--cuda", action="store_false", dest="cpu", help="Run the models on the GPU") -parser.add_argument("--cuda-device", help="Specify the CUDA device to use") -parser.add_argument("--mps", "--apple", "--m1", "--m2", action="store_false", dest="cpu", help="Run the models on Apple Silicon") -parser.set_defaults(cpu=True) -parser.add_argument("--summarization-model", help="Load a custom summarization model") -parser.add_argument( - "--classification-model", help="Load a custom text classification model" -) -parser.add_argument("--captioning-model", help="Load a custom captioning model") -parser.add_argument("--embedding-model", help="Load a custom text embedding model") -parser.add_argument("--chroma-host", help="Host IP for a remote ChromaDB instance") -parser.add_argument("--chroma-port", help="HTTP port for a remote ChromaDB instance (defaults to 8000)") -parser.add_argument("--chroma-folder", help="Path for chromadb persistence folder", default='.chroma_db') -parser.add_argument('--chroma-persist', help="ChromaDB persistence", default=True, action=argparse.BooleanOptionalAction) -parser.add_argument( - "--secure", action="store_true", help="Enforces the use of an API key" -) -parser.add_argument("--talkinghead-gpu", action="store_true", help="Run the talkinghead animation on the GPU (CPU is default)") -parser.add_argument( - "--talkinghead-model", type=str, help="The THA3 model to use. 'float' models are fp32, 'half' are fp16. 'auto' (default) picks fp16 for GPU and fp32 for CPU.", - required=False, default="auto", - choices=["auto", "standard_float", "separable_float", "standard_half", "separable_half"], -) -parser.add_argument( - "--talkinghead-models", metavar="HFREPO", - type=str, help="If THA3 models are not yet installed, use the given HuggingFace repository to install them. Defaults to OktayAlpk/talking-head-anime-3.", - default="OktayAlpk/talking-head-anime-3" -) - -parser.add_argument("--coqui-gpu", action="store_true", help="Run the voice models on the GPU (CPU is default)") -parser.add_argument("--coqui-models", help="Install given Coqui-api TTS model at launch (comma separated list, last one will be loaded at start)") - -parser.add_argument("--max-content-length", help="Set the max") -parser.add_argument("--rvc-save-file", action="store_true", help="Save the last rvc input/output audio file into data/tmp/ folder (for research)") - -parser.add_argument("--stt-vosk-model-path", help="Load a custom vosk speech-to-text model") -parser.add_argument("--stt-whisper-model-path", help="Load a custom vosk speech-to-text model") -sd_group = parser.add_mutually_exclusive_group() - -local_sd = parser.add_argument_group("sd-local") -local_sd.add_argument("--sd-model", help="Load a custom SD image generation model") -local_sd.add_argument("--sd-cpu", help="Force the SD pipeline to run on the CPU", action="store_true") - -remote_sd = parser.add_argument_group("sd-remote") -remote_sd.add_argument( - "--sd-remote", action="store_true", help="Use a remote backend for SD" -) -remote_sd.add_argument( - "--sd-remote-host", type=str, help="Specify the host of the remote SD backend" -) -remote_sd.add_argument( - "--sd-remote-port", type=int, help="Specify the port of the remote SD backend" -) -remote_sd.add_argument( - "--sd-remote-ssl", action="store_true", help="Use SSL for the remote SD backend" -) -remote_sd.add_argument( - "--sd-remote-auth", - type=str, - help="Specify the username:password for the remote SD backend (if required)", -) - -parser.add_argument( - "--enable-modules", - action=SplitArgs, - default=[], - help="Override a list of enabled modules", -) - -args = parser.parse_args() - -port = args.port if args.port else 5100 -host = "0.0.0.0" if args.listen else "localhost" -summarization_model = args.summarization_model if args.summarization_model else DEFAULT_SUMMARIZATION_MODEL -classification_model = args.classification_model if args.classification_model else DEFAULT_CLASSIFICATION_MODEL -captioning_model = args.captioning_model if args.captioning_model else DEFAULT_CAPTIONING_MODEL -embedding_model = args.embedding_model if args.embedding_model else DEFAULT_EMBEDDING_MODEL - -sd_use_remote = False if args.sd_model else True -sd_model = args.sd_model if args.sd_model else DEFAULT_SD_MODEL -sd_remote_host = args.sd_remote_host if args.sd_remote_host else DEFAULT_REMOTE_SD_HOST -sd_remote_port = args.sd_remote_port if args.sd_remote_port else DEFAULT_REMOTE_SD_PORT -sd_remote_ssl = args.sd_remote_ssl -sd_remote_auth = args.sd_remote_auth - -modules = ( - args.enable_modules if args.enable_modules and len(args.enable_modules) > 0 else [] -) - -if len(modules) == 0: - print( - f"{Fore.RED}{Style.BRIGHT}You did not select any modules to run! Choose them by adding an --enable-modules option" - ) - print(f"Example: --enable-modules=caption,summarize{Style.RESET_ALL}") - -# Models init -cuda_device = DEFAULT_CUDA_DEVICE if not args.cuda_device else args.cuda_device -device_string = cuda_device if torch.cuda.is_available() and not args.cpu else 'mps' if torch.backends.mps.is_available() and not args.cpu else 'cpu' -device = torch.device(device_string) -torch_dtype = torch.float32 if device_string != cuda_device else torch.float16 - -if not torch.cuda.is_available() and not args.cpu: - print(f"{Fore.YELLOW}{Style.BRIGHT}torch-cuda is not supported on this device.{Style.RESET_ALL}") - if not torch.backends.mps.is_available() and not args.cpu: - print(f"{Fore.YELLOW}{Style.BRIGHT}torch-mps is not supported on this device.{Style.RESET_ALL}") - - -print(f"{Fore.GREEN}{Style.BRIGHT}Using torch device: {device_string}{Style.RESET_ALL}") - -if "talkinghead" in modules: - talkinghead_path = os.path.abspath(os.path.join(os.getcwd(), "talkinghead")) - sys.path.append(talkinghead_path) # Add the path to the 'tha3' module to the sys.path list - - import sys - import threading - mode = "cuda" if args.talkinghead_gpu else "cpu" - model = args.talkinghead_model - if model == "auto": # default - # FP16 boosts the rendering performance by ~1.5x, but is only supported on GPU. - model = "separable_half" if args.talkinghead_gpu else "separable_float" - print(f"Initializing talkinghead pipeline in {mode} mode with model {model}....") - - try: - from talkinghead.tha3.app.util import maybe_install_models as talkinghead_maybe_install_models - - # Install the THA3 models if needed - talkinghead_models_dir = os.path.join(os.getcwd(), "talkinghead", "tha3", "models") - talkinghead_maybe_install_models(hf_reponame=args.talkinghead_models, modelsdir=talkinghead_models_dir) - - import talkinghead.tha3.app.app as talkinghead - def launch_talkinghead(): - # mode: choices='The device to use for PyTorch ("cuda" for GPU, "cpu" for CPU).' - # model: choices=['standard_float', 'separable_float', 'standard_half', 'separable_half'], - talkinghead.launch(mode, model) - talkinghead_thread = threading.Thread(target=launch_talkinghead) - talkinghead_thread.daemon = True # Set the thread as a daemon thread - talkinghead_thread.start() - - except ModuleNotFoundError: - print("Error: Could not import the 'talkinghead' module.") - -if "caption" in modules: - print("Initializing an image captioning model...") - captioning_pipeline = pipeline('image-to-text', model=captioning_model, device=device_string, torch_dtype=torch_dtype) - -if "summarize" in modules: - print("Initializing a text summarization model...") - summarization_pipeline = pipeline('summarization', model=summarization_model, device=device_string, torch_dtype=torch_dtype) - -if "sd" in modules and not sd_use_remote: - from diffusers import StableDiffusionPipeline - from diffusers import EulerAncestralDiscreteScheduler - - print("Initializing Stable Diffusion pipeline...") - sd_device_string = cuda_device if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' - sd_device = torch.device(sd_device_string) - sd_torch_dtype = torch.float32 if sd_device_string != cuda_device else torch.float16 - sd_pipe = StableDiffusionPipeline.from_pretrained( - sd_model, custom_pipeline="lpw_stable_diffusion", torch_dtype=sd_torch_dtype - ).to(sd_device) - sd_pipe.safety_checker = lambda images, clip_input: (images, False) - sd_pipe.enable_attention_slicing() - # pipe.scheduler = KarrasVeScheduler.from_config(pipe.scheduler.config) - sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config( - sd_pipe.scheduler.config - ) -elif "sd" in modules and sd_use_remote: - print("Initializing Stable Diffusion connection") - try: - sd_remote = webuiapi.WebUIApi( - host=sd_remote_host, port=sd_remote_port, use_https=sd_remote_ssl - ) - if sd_remote_auth: - username, password = sd_remote_auth.split(":") - sd_remote.set_auth(username, password) - sd_remote.util_wait_for_ready() - except Exception: - # remote sd from modules - print( - f"{Fore.RED}{Style.BRIGHT}Could not connect to remote SD backend at http{'s' if sd_remote_ssl else ''}://{sd_remote_host}:{sd_remote_port}! Disabling SD module...{Style.RESET_ALL}" - ) - modules.remove("sd") - -if "tts" in modules: - print("tts module is deprecated. Please use silero-tts instead.") - modules.remove("tts") - modules.append("silero-tts") - - -if "silero-tts" in modules: - if not os.path.exists(SILERO_SAMPLES_PATH): - os.makedirs(SILERO_SAMPLES_PATH) - print("Initializing Silero TTS server") - from silero_api_server import tts - - tts_service = tts.SileroTtsService(SILERO_SAMPLES_PATH) - if len(os.listdir(SILERO_SAMPLES_PATH)) == 0: - print("Generating Silero TTS samples...") - tts_service.update_sample_text(SILERO_SAMPLE_TEXT) - tts_service.generate_samples() - -if "edge-tts" in modules: - print("Initializing Edge TTS client") - import tts_edge as edge - - -if "chromadb" in modules: - print("Initializing ChromaDB") - import chromadb - import posthog - from chromadb.config import Settings - from chromadb.utils import embedding_functions - - # Assume that the user wants in-memory unless a host is specified - # Also disable chromadb telemetry - posthog.capture = lambda *args, **kwargs: None - if args.chroma_host is None: - if args.chroma_persist: - chromadb_client = chromadb.PersistentClient(path=args.chroma_folder, settings=Settings(anonymized_telemetry=False)) - print(f"ChromaDB is running in-memory with persistence. Persistence is stored in {args.chroma_folder}. Can be cleared by deleting the folder or purging db.") - else: - chromadb_client = chromadb.EphemeralClient(Settings(anonymized_telemetry=False)) - print("ChromaDB is running in-memory without persistence.") - else: - chroma_port = args.chroma_port if args.chroma_port else DEFAULT_CHROMA_PORT - chromadb_client = chromadb.HttpClient(host=args.chroma_host, port=chroma_port, settings=Settings(anonymized_telemetry=False)) - print(f"ChromaDB is remotely configured at {args.chroma_host}:{chroma_port}") - - chromadb_embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(embedding_model, device=device_string) - - # Check if the db is connected and running, otherwise tell the user - try: - chromadb_client.heartbeat() - print("Successfully pinged ChromaDB! Your client is successfully connected.") - except Exception: - print("Could not ping ChromaDB! If you are running remotely, please check your host and port!") - -# Flask init app = Flask(__name__) CORS(app) # allow cross-domain requests Compress(app) # compress responses -app.config["MAX_CONTENT_LENGTH"] = 500 * 1024 * 1024 -max_content_length = ( - args.max_content_length - if args.max_content_length - else None) +# will be populated later +args = [] +modules = [] -if max_content_length is not None: - print("Setting MAX_CONTENT_LENGTH to", max_content_length, "Mb") - app.config["MAX_CONTENT_LENGTH"] = int(max_content_length) * 1024 * 1024 - -if "classify" in modules: - import modules.classify.classify_module as classify_module - classify_module.init_text_emotion_classifier(classification_model, device, torch_dtype) - -if "vosk-stt" in modules: - print("Initializing Vosk speech-recognition (from ST request file)") - vosk_model_path = ( - args.stt_vosk_model_path - if args.stt_vosk_model_path - else None) - - import modules.speech_recognition.vosk_module as vosk_module - - vosk_module.model = vosk_module.load_model(file_path=vosk_model_path) - app.add_url_rule("/api/speech-recognition/vosk/process-audio", view_func=vosk_module.process_audio, methods=["POST"]) - -if "whisper-stt" in modules: - print("Initializing Whisper speech-recognition (from ST request file)") - whisper_model_path = ( - args.stt_whisper_model_path - if args.stt_whisper_model_path - else None) - - import modules.speech_recognition.whisper_module as whisper_module - - whisper_module.model = whisper_module.load_model(file_path=whisper_model_path) - app.add_url_rule("/api/speech-recognition/whisper/process-audio", view_func=whisper_module.process_audio, methods=["POST"]) - -if "streaming-stt" in modules: - print("Initializing vosk/whisper speech-recognition (from extras server microphone)") - whisper_model_path = ( - args.stt_whisper_model_path - if args.stt_whisper_model_path - else None) - - import modules.speech_recognition.streaming_module as streaming_module - - streaming_module.whisper_model, streaming_module.vosk_model = streaming_module.load_model(file_path=whisper_model_path) - app.add_url_rule("/api/speech-recognition/streaming/record-and-transcript", view_func=streaming_module.record_and_transcript, methods=["POST"]) - -if "rvc" in modules: - print("Initializing RVC voice conversion (from ST request file)") - print("Increasing server upload limit") - rvc_save_file = ( - args.rvc_save_file - if args.rvc_save_file - else False) - - if rvc_save_file: - print("RVC saving file option detected, input/output audio will be savec into data/tmp/ folder") - - import sys - sys.path.insert(0, 'modules/voice_conversion') - - import modules.voice_conversion.rvc_module as rvc_module - rvc_module.save_file = rvc_save_file - - if "classify" in modules: - rvc_module.classification_mode = True - - rvc_module.fix_model_install() - app.add_url_rule("/api/voice-conversion/rvc/get-models-list", view_func=rvc_module.rvc_get_models_list, methods=["POST"]) - app.add_url_rule("/api/voice-conversion/rvc/upload-models", view_func=rvc_module.rvc_upload_models, methods=["POST"]) - app.add_url_rule("/api/voice-conversion/rvc/process-audio", view_func=rvc_module.rvc_process_audio, methods=["POST"]) - - -if "coqui-tts" in modules: - mode = "GPU" if args.coqui_gpu else "CPU" - print("Initializing Coqui TTS client in " + mode + " mode") - import modules.text_to_speech.coqui.coqui_module as coqui_module - - if mode == "GPU": - coqui_module.gpu_mode = True - - coqui_models = ( - args.coqui_models - if args.coqui_models - else None - ) - - if coqui_models is not None: - coqui_models = coqui_models.split(",") - for i in coqui_models: - if not coqui_module.install_model(i): - raise ValueError("Coqui model loading failed, most likely a wrong model name in --coqui-models argument, check log above to see which one") - - # Coqui-api models - app.add_url_rule("/api/text-to-speech/coqui/coqui-api/check-model-state", view_func=coqui_module.coqui_check_model_state, methods=["POST"]) - app.add_url_rule("/api/text-to-speech/coqui/coqui-api/install-model", view_func=coqui_module.coqui_install_model, methods=["POST"]) - - # Users models - app.add_url_rule("/api/text-to-speech/coqui/local/get-models", view_func=coqui_module.coqui_get_local_models, methods=["POST"]) - - # Handle both coqui-api/users models - app.add_url_rule("/api/text-to-speech/coqui/generate-tts", view_func=coqui_module.coqui_generate_tts, methods=["POST"]) +# -------------------------------------------------------------------------------- +# General utilities def require_module(name): + """Parametric decorator. Mark an API endpoint implementation as requiring the specified module.""" def wrapper(fn): @wraps(fn) def decorated_view(*args, **kwargs): if name not in modules: - abort(403, "Module is disabled by config") + abort(403, f"Module '{name}' not enabled in config") return fn(*args, **kwargs) - return decorated_view - return wrapper +def normalize_string(input: str) -> str: + return " ".join(unicodedata.normalize("NFKC", input).strip().split()) -# AI stuff -def classify_text(text: str) -> list: - return classify_module.classify_text_emotion(text) +def image_to_base64(image: Image, quality: int = 75) -> str: + buffer = BytesIO() + image.convert("RGB") + image.save(buffer, format="JPEG", quality=quality) + img_str = base64.b64encode(buffer.getvalue()).decode("utf-8") + return img_str +ignore_auth = [] # will be populated later +def is_authorize_ignored(request): + view_func = app.view_functions.get(request.endpoint) + if view_func is not None: + if view_func in ignore_auth: + return True + return False -def caption_image(raw_image: Image) -> str: - caption = captioning_pipeline(raw_image.convert("RGB"))[0]['generated_text'] - return caption +# -------------------------------------------------------------------------------- +# Web API and its support functions +api_key = None # will be populated later -def summarize_chunks(text: str) -> str: +@app.before_request +def before_request(): + # Request time measuring + request.start_time = time.time() + + # Checks if an API key is present and valid, otherwise return unauthorized + # The options check is required so CORS doesn't get angry try: - return summarize(text) - except IndexError: - print( - "Sequence length too large for model, cutting text in half and calling again" - ) - return summarize_chunks( - text[: (len(text) // 2)] - ) + summarize_chunks(text[(len(text) // 2):]) + if request.method != 'OPTIONS' and args.secure and not is_authorize_ignored(request) and getattr(request.authorization, 'token', '') != api_key: + print(f"{Fore.RED}{Style.NORMAL}WARNING: Unauthorized API key access from {request.remote_addr}{Style.RESET_ALL}") + response = jsonify({'error': '401: Invalid API key'}) + response.status_code = 401 + return response + except Exception as e: + print(f"API key check error: {e}") + return "401 Unauthorized\n{}\n\n".format(e), 401 +@app.after_request +def after_request(response): + duration = time.time() - request.start_time + response.headers["X-Request-Duration"] = str(duration) + return response -def summarize(text: str) -> str: +@app.route("/", methods=["GET"]) +def index(): + with open("./README.md", "r", encoding="utf8") as f: + content = f.read() + return render_template_string(markdown.markdown(content, extensions=["tables"])) + +@app.route("/api/extensions", methods=["GET"]) +def get_extensions(): + extensions = dict( + { + "extensions": [ + { + "name": "not-supported", + "metadata": { + "display_name": """Extensions serving using Extensions API is no longer supported. Please update the mod from: https://github.com/Cohee1207/SillyTavern""", + "requires": [], + "assets": [], + }, + } + ] + } + ) + return jsonify(extensions) + +# ---------------------------------------- +# caption + +captioning_pipeline = None # populated when the module is loaded +def _caption_image(raw_image: Image) -> str: + return captioning_pipeline(raw_image.convert("RGB"))[0]['generated_text'] + +@app.route("/api/caption", methods=["POST"]) +@require_module("caption") +def api_caption(): + data = request.get_json() + + if "image" not in data or not isinstance(data["image"], str): + abort(400, '"image" is required') + + image = Image.open(BytesIO(base64.b64decode(data["image"]))) + image = image.convert("RGB") + image.thumbnail((512, 512)) + caption = _caption_image(image) + thumbnail = image_to_base64(image) + print("Caption:", caption, sep="\n") + gc.collect() + return jsonify({"caption": caption, "thumbnail": thumbnail}) + +# ---------------------------------------- +# summarize + +summarization_pipeline = None # populated when the module is loaded +def _summarize(text: str) -> str: summary = normalize_string(summarization_pipeline(text)[0]['summary_text']) return summary +def _summarize_chunks(text: str) -> str: + """Summarize `text`, chunking it if necessary.""" + try: + return _summarize(text) + except IndexError: + print("Sequence length too large for model, cutting text in half and calling again") + return (_summarize_chunks(text[:(len(text) // 2)]) + + _summarize_chunks(text[(len(text) // 2):])) -def normalize_string(input: str) -> str: - output = " ".join(unicodedata.normalize("NFKC", input).strip().split()) - return output +@app.route("/api/summarize", methods=["POST"]) +@require_module("summarize") +def api_summarize(): + """Summarize the text posted in the request. Return the summary.""" + data = request.get_json() + if "text" not in data or not isinstance(data["text"], str): + abort(400, '"text" is required') -def generate_image(data: dict) -> Image: + print("Summary input:", data["text"], sep="\n") + summary = _summarize_chunks(data["text"]) + print("Summary output:", summary, sep="\n") + gc.collect() + return jsonify({"summary": summary}) + +# ---------------------------------------- +# classify + +classify_module = None # populated when the module is loaded +def _classify_text(text: str) -> list: + return classify_module.classify_text_emotion(text) + +@app.route("/api/classify", methods=["POST"]) +@require_module("classify") +def api_classify(): + """Perform sentiment analysis (classification) on the text posted in the request. Return the result. + + Also, if `talkinghead` is enabled, automatically update its emotion based on the classification result. + """ + data = request.get_json() + + if "text" not in data or not isinstance(data["text"], str): + abort(400, '"text" is required') + + print("Classification input:", data["text"], sep="\n") + classification = _classify_text(data["text"]) + print("Classification output:", classification, sep="\n") + gc.collect() + # TODO: Feature orthogonality: would be better if the client called the `set_emotion` endpoint explicitly + # also when it uses `classify`, if it intends to update the talkinghead state. + if "talkinghead" in modules: # send emotion to talkinghead + print("Updating talkinghead emotion from classification results") + talkinghead.set_emotion_from_classification(classification) + return jsonify({"classification": classification}) + +@app.route("/api/classify/labels", methods=["GET"]) +@require_module("classify") +def api_classify_labels(): + """Return the available classifier labels for text sentiment (character emotion).""" + classification = _classify_text("") + labels = [x["label"] for x in classification] + if "talkinghead" in modules: + labels.append('talkinghead') # Add 'talkinghead' to the labels list + return jsonify({"labels": labels}) + +# ---------------------------------------- +# talkinghead + +talkinghead = None # populated when the module is loaded + +@app.route("/api/talkinghead/load", methods=["POST"]) +@require_module("talkinghead") +def api_talkinghead_load(): + """Load the talkinghead sprite posted in the request. Resume animation if the talkinghead module was paused.""" + file = request.files['file'] + # convert stream to bytes and pass to talkinghead + return talkinghead.talkinghead_load_file(file.stream) + +@app.route('/api/talkinghead/load_emotion_templates', methods=["POST"]) +@require_module("talkinghead") +def api_talkinghead_load_emotion_templates(): + """Load custom emotion templates for talkinghead, or reset to defaults. + + Input format is JSON:: + + {"emotion0": {"morph0": value0, + ...} + ...} + + For details, see `Animator.load_emotion_templates` in `talkinghead/tha3/app/app.py`. + + To reload server defaults, send a blank JSON. + + This API endpoint becomes available after the talkinghead has been launched. + """ + if talkinghead.global_animator_instance is None: + abort(400, 'talkinghead not launched') + data = request.get_json() + if not len(data): + data = None # sending `None` to talkinghead will reset to defaults + talkinghead.global_animator_instance.load_emotion_templates(data) + return "OK" + +@app.route('/api/talkinghead/load_animator_settings', methods=["POST"]) +@require_module("talkinghead") +def api_talkinghead_load_animator_settings(): + """Load custom settings for talkinghead animator and postprocessor, or reset to defaults. + + Input format is JSON:: + + {"name0": value0, + ...} + + For details, see `Animator.load_animator_settings` in `talkinghead/tha3/app/app.py`. + + To reload server defaults, send a blank JSON. + + This API endpoint becomes available after the talkinghead has been launched. + """ + if talkinghead.global_animator_instance is None: + abort(400, 'talkinghead not launched') + data = request.get_json() + if not len(data): + data = None # sending `None` to talkinghead will reset to defaults + talkinghead.global_animator_instance.load_animator_settings(data) + return "OK" + +@app.route('/api/talkinghead/unload') +@require_module("talkinghead") +def api_talkinghead_unload(): + """Pause the talkinghead module. To resume, load a character via '/api/talkinghead/load'.""" + return talkinghead.unload() + +@app.route('/api/talkinghead/start_talking') +@require_module("talkinghead") +def api_talkinghead_start_talking(): + """Start the mouth animation for talking.""" + return talkinghead.start_talking() + +@app.route('/api/talkinghead/stop_talking') +@require_module("talkinghead") +def api_talkinghead_stop_talking(): + """Stop the mouth animation for talking.""" + return talkinghead.stop_talking() + +@app.route('/api/talkinghead/set_emotion', methods=["POST"]) +@require_module("talkinghead") +def api_talkinghead_set_emotion(): + """Set talkinghead character emotion to that posted in the request. + + Input format is JSON:: + + {"emotion_name": "curiosity"} + + where the key "emotion_name" is literal, and the value is the emotion to set. + + There is no getter, because SillyTavern keeps its state in the frontend + and the plugins only act as slaves (in the technological sense of the word). + """ + data = request.get_json() + if "emotion_name" not in data or not isinstance(data["emotion_name"], str): + abort(400, '"emotion_name" is required') + emotion_name = data["emotion_name"] + return talkinghead.set_emotion(emotion_name) + +@app.route('/api/talkinghead/result_feed') +@require_module("talkinghead") +def api_talkinghead_result_feed(): + """Live character output. Stream of video frames, each as a PNG encoded image.""" + return talkinghead.result_feed() + +# ---------------------------------------- +# sd + +sd_use_remote = None # populated when the module is loaded +sd_pipe = None +sd_remote = None +sd_model = None + +def _generate_image(data: dict) -> Image: prompt = normalize_string(f'{data["prompt_prefix"]} {data["prompt"]}') if sd_use_remote: @@ -498,206 +385,6 @@ def generate_image(data: dict) -> Image: image.save("./debug.png") return image - -def image_to_base64(image: Image, quality: int = 75) -> str: - buffer = BytesIO() - image.convert("RGB") - image.save(buffer, format="JPEG", quality=quality) - img_str = base64.b64encode(buffer.getvalue()).decode("utf-8") - return img_str - -ignore_auth = [] -# Reads an API key from an already existing file. If that file doesn't exist, create it. -if args.secure: - try: - with open("api_key.txt", "r") as txt: - api_key = txt.read().replace('\n', '') - except Exception: - api_key = secrets.token_hex(5) - with open("api_key.txt", "w") as txt: - txt.write(api_key) - - print(f"{Fore.YELLOW}{Style.BRIGHT}Your API key is {api_key}{Style.RESET_ALL}") -elif args.share and not args.secure: - print(f"{Fore.RED}{Style.BRIGHT}WARNING: This instance is publicly exposed without an API key! It is highly recommended to restart with the \"--secure\" argument!{Style.RESET_ALL}") -else: - print(f"{Fore.YELLOW}{Style.BRIGHT}No API key given because you are running locally.{Style.RESET_ALL}") - - -def is_authorize_ignored(request): - view_func = app.view_functions.get(request.endpoint) - - if view_func is not None: - if view_func in ignore_auth: - return True - return False - - -@app.before_request -def before_request(): - # Request time measuring - request.start_time = time.time() - - # Checks if an API key is present and valid, otherwise return unauthorized - # The options check is required so CORS doesn't get angry - try: - if request.method != 'OPTIONS' and args.secure and not is_authorize_ignored(request) and getattr(request.authorization, 'token', '') != api_key: - print(f"{Fore.RED}{Style.NORMAL}WARNING: Unauthorized API key access from {request.remote_addr}{Style.RESET_ALL}") - response = jsonify({'error': '401: Invalid API key'}) - response.status_code = 401 - return response - except Exception as e: - print(f"API key check error: {e}") - return "401 Unauthorized\n{}\n\n".format(e), 401 - - -@app.after_request -def after_request(response): - duration = time.time() - request.start_time - response.headers["X-Request-Duration"] = str(duration) - return response - - -@app.route("/", methods=["GET"]) -def index(): - with open("./README.md", "r", encoding="utf8") as f: - content = f.read() - return render_template_string(markdown.markdown(content, extensions=["tables"])) - - -@app.route("/api/extensions", methods=["GET"]) -def get_extensions(): - extensions = dict( - { - "extensions": [ - { - "name": "not-supported", - "metadata": { - "display_name": """Extensions serving using Extensions API is no longer supported. Please update the mod from: https://github.com/Cohee1207/SillyTavern""", - "requires": [], - "assets": [], - }, - } - ] - } - ) - return jsonify(extensions) - - -@app.route("/api/caption", methods=["POST"]) -@require_module("caption") -def api_caption(): - data = request.get_json() - - if "image" not in data or not isinstance(data["image"], str): - abort(400, '"image" is required') - - image = Image.open(BytesIO(base64.b64decode(data["image"]))) - image = image.convert("RGB") - image.thumbnail((512, 512)) - caption = caption_image(image) - thumbnail = image_to_base64(image) - print("Caption:", caption, sep="\n") - gc.collect() - return jsonify({"caption": caption, "thumbnail": thumbnail}) - - -@app.route("/api/summarize", methods=["POST"]) -@require_module("summarize") -def api_summarize(): - """Summarize the text posted in the request. Return the summary.""" - data = request.get_json() - - if "text" not in data or not isinstance(data["text"], str): - abort(400, '"text" is required') - - print("Summary input:", data["text"], sep="\n") - summary = summarize_chunks(data["text"]) - print("Summary output:", summary, sep="\n") - gc.collect() - return jsonify({"summary": summary}) - - -@app.route("/api/classify", methods=["POST"]) -@require_module("classify") -def api_classify(): - """Perform sentiment analysis (classification) on the text posted in the request. Return the result. - - Also, if `talkinghead` is enabled, automatically update its emotion based on the classification result. - """ - data = request.get_json() - - if "text" not in data or not isinstance(data["text"], str): - abort(400, '"text" is required') - - print("Classification input:", data["text"], sep="\n") - classification = classify_text(data["text"]) - print("Classification output:", classification, sep="\n") - gc.collect() - # TODO: Feature orthogonality: would be better if the client called the `set_emotion` endpoint explicitly - # also when it uses `classify`, if it intends to update the talkinghead state. - if "talkinghead" in modules: # send emotion to talkinghead - print("Updating talkinghead emotion from classification results") - talkinghead.set_emotion_from_classification(classification) - return jsonify({"classification": classification}) - - -@app.route("/api/classify/labels", methods=["GET"]) -@require_module("classify") -def api_classify_labels(): - """Return the available classifier labels for text sentiment (character emotion).""" - classification = classify_text("") - labels = [x["label"] for x in classification] - if "talkinghead" in modules: - labels.append('talkinghead') # Add 'talkinghead' to the labels list - return jsonify({"labels": labels}) - -@app.route("/api/talkinghead/load", methods=["POST"]) -@require_module("talkinghead") -def api_talkinghead_load(): - """Load the talkinghead sprite posted in the request. Resume animation if paused.""" - file = request.files['file'] - # convert stream to bytes and pass to talkinghead - return talkinghead.talkinghead_load_file(file.stream) - -@app.route('/api/talkinghead/unload') -@require_module("talkinghead") -def api_talkinghead_unload(): - """Pause talkinghead animation. Can be enabled again via '/api/talkinghead/load'.""" - return talkinghead.unload() - -@app.route('/api/talkinghead/start_talking') -@require_module("talkinghead") -def api_talkinghead_start_talking(): - """Start the mouth animation for talking.""" - return talkinghead.start_talking() - -@app.route('/api/talkinghead/stop_talking') -@require_module("talkinghead") -def api_talkinghead_stop_talking(): - """Stop the mouth animation for talking.""" - return talkinghead.stop_talking() - -@app.route('/api/talkinghead/set_emotion', methods=["POST"]) -@require_module("talkinghead") -def api_talkinghead_set_emotion(): - """Set talkinghead character emotion to that posted in the request. - - There is no getter, because SillyTavern keeps its state in the frontend - and the plugins only act as slaves (in the technological sense of the word). - """ - data = request.get_json() - if "emotion_name" not in data or not isinstance(data["emotion_name"], str): - abort(400, '"emotion_name" is required') - emotion_name = data["emotion_name"] - return talkinghead.set_emotion(emotion_name) - -@app.route('/api/talkinghead/result_feed') -@require_module("talkinghead") -def api_talkinghead_result_feed(): - """Live character output. Stream of video frames, each as a PNG encoded image.""" - return talkinghead.result_feed() - @app.route("/api/image", methods=["POST"]) @require_module("sd") def api_image(): @@ -736,13 +423,12 @@ def api_image(): try: print("SD inputs:", data, sep="\n") - image = generate_image(data) + image = _generate_image(data) base64image = image_to_base64(image, quality=90) return jsonify({"image": base64image}) except RuntimeError as e: abort(400, str(e)) - @app.route("/api/image/model", methods=["POST"]) @require_module("sd") def api_image_model_set(): @@ -761,7 +447,6 @@ def api_image_model_set(): return jsonify({"previous_model": old_model, "current_model": new_model}) - @app.route("/api/image/model", methods=["GET"]) @require_module("sd") def api_image_model_get(): @@ -772,7 +457,6 @@ def api_image_model_get(): return jsonify({"model": model}) - @app.route("/api/image/models", methods=["GET"]) @require_module("sd") def api_image_models(): @@ -783,7 +467,6 @@ def api_image_models(): return jsonify({"models": models}) - @app.route("/api/image/samplers", methods=["GET"]) @require_module("sd") def api_image_samplers(): @@ -794,11 +477,14 @@ def api_image_samplers(): return jsonify({"samplers": samplers}) - @app.route("/api/modules", methods=["GET"]) def get_modules(): return jsonify({"modules": modules}) +# ---------------------------------------- +# tts + +tts_service = None # populated when the module is loaded @app.route("/api/tts/speakers", methods=["GET"]) @require_module("silero-tts") @@ -838,12 +524,15 @@ def tts_generate(): print(e) abort(500, voice["speaker"]) - @app.route("/api/tts/sample/", methods=["GET"]) @require_module("silero-tts") def tts_play_sample(speaker: str): return send_from_directory(SILERO_SAMPLES_PATH, f"{speaker}.wav") +# ---------------------------------------- +# edge-tts + +edge = None # populated when the module is loaded @app.route("/api/edge-tts/list", methods=["GET"]) @require_module("edge-tts") @@ -851,7 +540,6 @@ def edge_tts_list(): voices = edge.get_voices() return jsonify(voices) - @app.route("/api/edge-tts/generate", methods=["POST"]) @require_module("edge-tts") def edge_tts_generate(): @@ -873,6 +561,11 @@ def edge_tts_generate(): print(e) abort(500, data["voice"]) +# ---------------------------------------- +# chromadb + +chromadb_client = None # populated when the module is loaded +chromadb_embed_fn = None @app.route("/api/chromadb", methods=["POST"]) @require_module("chromadb") @@ -903,7 +596,6 @@ def chromadb_add_messages(): return jsonify({"count": len(ids)}) - @app.route("/api/chromadb/purge", methods=["POST"]) @require_module("chromadb") def chromadb_purge(): @@ -921,7 +613,6 @@ def chromadb_purge(): return 'Ok', 200 - @app.route("/api/chromadb/query", methods=["POST"]) @require_module("chromadb") def chromadb_query(): @@ -1033,7 +724,6 @@ def chromadb_multiquery(): return jsonify(messages) - @app.route("/api/chromadb/export", methods=["POST"]) @require_module("chromadb") def chromadb_export(): @@ -1095,6 +785,8 @@ def chromadb_import(): return jsonify({"count": len(ids)}) +# ---------------------------------------- +# websearch @app.route("/api/websearch", methods=["POST"]) @require_module("websearch") @@ -1114,18 +806,393 @@ def api_websearch(): return jsonify({"results": results[0], "links": results[1]}) +# -------------------------------------------------------------------------------- +# Main program + +# Setting Root Folders for Silero Generations so it is compatible with STSL, should not effect regular runs. - Rolyat +parent_dir = os.path.dirname(os.path.abspath(__file__)) +SILERO_SAMPLES_PATH = os.path.join(parent_dir, "tts_samples") +SILERO_SAMPLE_TEXT = os.path.join(parent_dir) + +# Create directories if they don't exist +if not os.path.exists(SILERO_SAMPLES_PATH): + os.makedirs(SILERO_SAMPLES_PATH) +if not os.path.exists(SILERO_SAMPLE_TEXT): + os.makedirs(SILERO_SAMPLE_TEXT) + +# ---------------------------------------- +# Script arguments + +parser = argparse.ArgumentParser( + prog="SillyTavern Extras", description="Web API for transformers models" +) +parser.add_argument( + "--port", type=int, help="Specify the port on which the application is hosted" +) +parser.add_argument( + "--listen", action="store_true", help="Host the app on the local network" +) +parser.add_argument( + "--share", action="store_true", help="Share the app on CloudFlare tunnel" +) +parser.add_argument("--cpu", action="store_true", help="Run the models on the CPU") +parser.add_argument("--cuda", action="store_false", dest="cpu", help="Run the models on the GPU") +parser.add_argument("--cuda-device", help="Specify the CUDA device to use") +parser.add_argument("--mps", "--apple", "--m1", "--m2", action="store_false", dest="cpu", help="Run the models on Apple Silicon") +parser.set_defaults(cpu=True) +parser.add_argument("--summarization-model", help="Load a custom summarization model") +parser.add_argument( + "--classification-model", help="Load a custom text classification model" +) +parser.add_argument("--captioning-model", help="Load a custom captioning model") +parser.add_argument("--embedding-model", help="Load a custom text embedding model") +parser.add_argument("--chroma-host", help="Host IP for a remote ChromaDB instance") +parser.add_argument("--chroma-port", help="HTTP port for a remote ChromaDB instance (defaults to 8000)") +parser.add_argument("--chroma-folder", help="Path for chromadb persistence folder", default='.chroma_db') +parser.add_argument('--chroma-persist', help="ChromaDB persistence", default=True, action=argparse.BooleanOptionalAction) +parser.add_argument( + "--secure", action="store_true", help="Enforces the use of an API key" +) +parser.add_argument("--talkinghead-gpu", action="store_true", help="Run the talkinghead animation on the GPU (CPU is default)") +parser.add_argument( + "--talkinghead-model", type=str, help="The THA3 model to use. 'float' models are fp32, 'half' are fp16. 'auto' (default) picks fp16 for GPU and fp32 for CPU.", + required=False, default="auto", + choices=["auto", "standard_float", "separable_float", "standard_half", "separable_half"], +) +parser.add_argument( + "--talkinghead-models", metavar="HFREPO", + type=str, help="If THA3 models are not yet installed, use the given HuggingFace repository to install them. Defaults to OktayAlpk/talking-head-anime-3.", + default="OktayAlpk/talking-head-anime-3" +) + +parser.add_argument("--coqui-gpu", action="store_true", help="Run the voice models on the GPU (CPU is default)") +parser.add_argument("--coqui-models", help="Install given Coqui-api TTS model at launch (comma separated list, last one will be loaded at start)") + +parser.add_argument("--max-content-length", help="Set the max") +parser.add_argument("--rvc-save-file", action="store_true", help="Save the last rvc input/output audio file into data/tmp/ folder (for research)") + +parser.add_argument("--stt-vosk-model-path", help="Load a custom vosk speech-to-text model") +parser.add_argument("--stt-whisper-model-path", help="Load a custom vosk speech-to-text model") +# sd_group = parser.add_mutually_exclusive_group() + +local_sd = parser.add_argument_group("sd-local") +local_sd.add_argument("--sd-model", help="Load a custom SD image generation model") +local_sd.add_argument("--sd-cpu", help="Force the SD pipeline to run on the CPU", action="store_true") + +remote_sd = parser.add_argument_group("sd-remote") +remote_sd.add_argument( + "--sd-remote", action="store_true", help="Use a remote backend for SD" +) +remote_sd.add_argument( + "--sd-remote-host", type=str, help="Specify the host of the remote SD backend" +) +remote_sd.add_argument( + "--sd-remote-port", type=int, help="Specify the port of the remote SD backend" +) +remote_sd.add_argument( + "--sd-remote-ssl", action="store_true", help="Use SSL for the remote SD backend" +) +remote_sd.add_argument( + "--sd-remote-auth", + type=str, + help="Specify the username:password for the remote SD backend (if required)", +) + +class SplitArgs(argparse.Action): + """Remove quotes and split a comma-delimited list into a python list.""" + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, values.replace('"', "").replace("'", "").split(",")) +parser.add_argument( + "--enable-modules", + action=SplitArgs, + default=[], + help="Override a list of enabled modules", +) + +args = parser.parse_args() + +port = args.port if args.port else 5100 +host = "0.0.0.0" if args.listen else "localhost" +summarization_model = args.summarization_model if args.summarization_model else DEFAULT_SUMMARIZATION_MODEL +classification_model = args.classification_model if args.classification_model else DEFAULT_CLASSIFICATION_MODEL +captioning_model = args.captioning_model if args.captioning_model else DEFAULT_CAPTIONING_MODEL +embedding_model = args.embedding_model if args.embedding_model else DEFAULT_EMBEDDING_MODEL + +sd_use_remote = False if args.sd_model else True +sd_model = args.sd_model if args.sd_model else DEFAULT_SD_MODEL +sd_remote_host = args.sd_remote_host if args.sd_remote_host else DEFAULT_REMOTE_SD_HOST +sd_remote_port = args.sd_remote_port if args.sd_remote_port else DEFAULT_REMOTE_SD_PORT +sd_remote_ssl = args.sd_remote_ssl +sd_remote_auth = args.sd_remote_auth + +modules = args.enable_modules if args.enable_modules else [] + +if not modules: + print(f"{Fore.RED}{Style.BRIGHT}You did not select any modules to run! Choose them by adding an --enable-modules option") + print(f"Example: --enable-modules=caption,summarize{Style.RESET_ALL}") + +# ---------------------------------------- +# Flask init + +app.config["MAX_CONTENT_LENGTH"] = 500 * 1024 * 1024 +max_content_length = args.max_content_length if args.max_content_length else None +if max_content_length is not None: + print("Setting MAX_CONTENT_LENGTH to", max_content_length, "Mb") + app.config["MAX_CONTENT_LENGTH"] = int(max_content_length) * 1024 * 1024 + +# ---------------------------------------- +# Modules init + +cuda_device = DEFAULT_CUDA_DEVICE if not args.cuda_device else args.cuda_device +device_string = cuda_device if torch.cuda.is_available() and not args.cpu else 'mps' if torch.backends.mps.is_available() and not args.cpu else 'cpu' +device = torch.device(device_string) +torch_dtype = torch.float32 if device_string != cuda_device else torch.float16 + +if not torch.cuda.is_available() and not args.cpu: + print(f"{Fore.YELLOW}{Style.BRIGHT}torch-cuda is not supported on this device.{Style.RESET_ALL}") + if not torch.backends.mps.is_available() and not args.cpu: + print(f"{Fore.YELLOW}{Style.BRIGHT}torch-mps is not supported on this device.{Style.RESET_ALL}") + +print(f"{Fore.GREEN}{Style.BRIGHT}Using torch device: {device_string}{Style.RESET_ALL}") + +if "talkinghead" in modules: + talkinghead_path = os.path.abspath(os.path.join(os.getcwd(), "talkinghead")) + sys.path.append(talkinghead_path) # Add the path to the 'tha3' module to the sys.path list + + import sys + mode = "cuda" if args.talkinghead_gpu else "cpu" + model = args.talkinghead_model + if model == "auto": # default + # FP16 boosts the rendering performance by ~1.5x, but is only supported on GPU. + model = "separable_half" if args.talkinghead_gpu else "separable_float" + print(f"Initializing {Fore.GREEN}{Style.BRIGHT}talkinghead{Style.RESET_ALL} pipeline in {Fore.GREEN}{Style.BRIGHT}{mode}{Style.RESET_ALL} mode with model {Fore.GREEN}{Style.BRIGHT}{model}{Style.RESET_ALL}...") + + try: + from talkinghead.tha3.app.util import maybe_install_models as talkinghead_maybe_install_models + + # Install the THA3 models if needed + talkinghead_models_dir = os.path.join(os.getcwd(), "talkinghead", "tha3", "models") + talkinghead_maybe_install_models(hf_reponame=args.talkinghead_models, modelsdir=talkinghead_models_dir) + + import talkinghead.tha3.app.app as talkinghead + # mode: choices='The device to use for PyTorch ("cuda" for GPU, "cpu" for CPU).' + # model: choices=['standard_float', 'separable_float', 'standard_half', 'separable_half'], + talkinghead.launch(mode, model) + + except ModuleNotFoundError: + print("Error: Could not import the 'talkinghead' module.") + +if "caption" in modules: + print("Initializing an image captioning model...") + captioning_pipeline = pipeline('image-to-text', model=captioning_model, device=device_string, torch_dtype=torch_dtype) + +if "summarize" in modules: + print("Initializing a text summarization model...") + summarization_pipeline = pipeline('summarization', model=summarization_model, device=device_string, torch_dtype=torch_dtype) + +if "sd" in modules and not sd_use_remote: + from diffusers import StableDiffusionPipeline + from diffusers import EulerAncestralDiscreteScheduler + + print("Initializing Stable Diffusion pipeline...") + sd_device_string = cuda_device if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' + sd_device = torch.device(sd_device_string) + sd_torch_dtype = torch.float32 if sd_device_string != cuda_device else torch.float16 + sd_pipe = StableDiffusionPipeline.from_pretrained( + sd_model, custom_pipeline="lpw_stable_diffusion", torch_dtype=sd_torch_dtype + ).to(sd_device) + sd_pipe.safety_checker = lambda images, clip_input: (images, False) + sd_pipe.enable_attention_slicing() + # pipe.scheduler = KarrasVeScheduler.from_config(pipe.scheduler.config) + sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config( + sd_pipe.scheduler.config + ) +elif "sd" in modules and sd_use_remote: + print("Initializing Stable Diffusion connection") + try: + sd_remote = webuiapi.WebUIApi( + host=sd_remote_host, port=sd_remote_port, use_https=sd_remote_ssl + ) + if sd_remote_auth: + username, password = sd_remote_auth.split(":") + sd_remote.set_auth(username, password) + sd_remote.util_wait_for_ready() + except Exception: + # remote sd from modules + print( + f"{Fore.RED}{Style.BRIGHT}Could not connect to remote SD backend at http{'s' if sd_remote_ssl else ''}://{sd_remote_host}:{sd_remote_port}! Disabling SD module...{Style.RESET_ALL}" + ) + modules.remove("sd") + +if "tts" in modules: + print("tts module is deprecated. Please use silero-tts instead.") + modules.remove("tts") + modules.append("silero-tts") + +if "silero-tts" in modules: + if not os.path.exists(SILERO_SAMPLES_PATH): + os.makedirs(SILERO_SAMPLES_PATH) + print("Initializing Silero TTS server") + from silero_api_server import tts + + tts_service = tts.SileroTtsService(SILERO_SAMPLES_PATH) + if len(os.listdir(SILERO_SAMPLES_PATH)) == 0: + print("Generating Silero TTS samples...") + tts_service.update_sample_text(SILERO_SAMPLE_TEXT) + tts_service.generate_samples() + +if "edge-tts" in modules: + print("Initializing Edge TTS client") + import tts_edge as edge + +if "chromadb" in modules: + print("Initializing ChromaDB") + import chromadb + import posthog + from chromadb.config import Settings + from chromadb.utils import embedding_functions + + # Assume that the user wants in-memory unless a host is specified + # Also disable chromadb telemetry + posthog.capture = lambda *args, **kwargs: None + if args.chroma_host is None: + if args.chroma_persist: + chromadb_client = chromadb.PersistentClient(path=args.chroma_folder, settings=Settings(anonymized_telemetry=False)) + print(f"ChromaDB is running in-memory with persistence. Persistence is stored in {args.chroma_folder}. Can be cleared by deleting the folder or purging db.") + else: + chromadb_client = chromadb.EphemeralClient(Settings(anonymized_telemetry=False)) + print("ChromaDB is running in-memory without persistence.") + else: + chroma_port = args.chroma_port if args.chroma_port else DEFAULT_CHROMA_PORT + chromadb_client = chromadb.HttpClient(host=args.chroma_host, port=chroma_port, settings=Settings(anonymized_telemetry=False)) + print(f"ChromaDB is remotely configured at {args.chroma_host}:{chroma_port}") + + chromadb_embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(embedding_model, device=device_string) + + # Check if the db is connected and running, otherwise tell the user + try: + chromadb_client.heartbeat() + print("Successfully pinged ChromaDB! Your client is successfully connected.") + except Exception: + print("Could not ping ChromaDB! If you are running remotely, please check your host and port!") + +if "classify" in modules: + import modules.classify.classify_module as classify_module + classify_module.init_text_emotion_classifier(classification_model, device, torch_dtype) + +if "vosk-stt" in modules: + print("Initializing Vosk speech-recognition (from ST request file)") + vosk_model_path = ( + args.stt_vosk_model_path + if args.stt_vosk_model_path + else None) + + import modules.speech_recognition.vosk_module as vosk_module + + vosk_module.model = vosk_module.load_model(file_path=vosk_model_path) + app.add_url_rule("/api/speech-recognition/vosk/process-audio", view_func=vosk_module.process_audio, methods=["POST"]) + +if "whisper-stt" in modules: + print("Initializing Whisper speech-recognition (from ST request file)") + whisper_model_path = ( + args.stt_whisper_model_path + if args.stt_whisper_model_path + else None) + + import modules.speech_recognition.whisper_module as whisper_module + + whisper_module.model = whisper_module.load_model(file_path=whisper_model_path) + app.add_url_rule("/api/speech-recognition/whisper/process-audio", view_func=whisper_module.process_audio, methods=["POST"]) + +if "streaming-stt" in modules: + print("Initializing vosk/whisper speech-recognition (from extras server microphone)") + whisper_model_path = ( + args.stt_whisper_model_path + if args.stt_whisper_model_path + else None) + + import modules.speech_recognition.streaming_module as streaming_module + + streaming_module.whisper_model, streaming_module.vosk_model = streaming_module.load_model(file_path=whisper_model_path) + app.add_url_rule("/api/speech-recognition/streaming/record-and-transcript", view_func=streaming_module.record_and_transcript, methods=["POST"]) + +if "rvc" in modules: + print("Initializing RVC voice conversion (from ST request file)") + print("Increasing server upload limit") + rvc_save_file = ( + args.rvc_save_file + if args.rvc_save_file + else False) + + if rvc_save_file: + print("RVC saving file option detected, input/output audio will be savec into data/tmp/ folder") + + sys.path.insert(0, 'modules/voice_conversion') + + import modules.voice_conversion.rvc_module as rvc_module + rvc_module.save_file = rvc_save_file + + if "classify" in modules: + rvc_module.classification_mode = True + + rvc_module.fix_model_install() + app.add_url_rule("/api/voice-conversion/rvc/get-models-list", view_func=rvc_module.rvc_get_models_list, methods=["POST"]) + app.add_url_rule("/api/voice-conversion/rvc/upload-models", view_func=rvc_module.rvc_upload_models, methods=["POST"]) + app.add_url_rule("/api/voice-conversion/rvc/process-audio", view_func=rvc_module.rvc_process_audio, methods=["POST"]) + +if "coqui-tts" in modules: + mode = "GPU" if args.coqui_gpu else "CPU" + print("Initializing Coqui TTS client in " + mode + " mode") + import modules.text_to_speech.coqui.coqui_module as coqui_module + + if mode == "GPU": + coqui_module.gpu_mode = True + + coqui_models = ( + args.coqui_models + if args.coqui_models + else None + ) + + if coqui_models is not None: + coqui_models = coqui_models.split(",") + for i in coqui_models: + if not coqui_module.install_model(i): + raise ValueError("Coqui model loading failed, most likely a wrong model name in --coqui-models argument, check log above to see which one") + + # Coqui-api models + app.add_url_rule("/api/text-to-speech/coqui/coqui-api/check-model-state", view_func=coqui_module.coqui_check_model_state, methods=["POST"]) + app.add_url_rule("/api/text-to-speech/coqui/coqui-api/install-model", view_func=coqui_module.coqui_install_model, methods=["POST"]) + + # Users models + app.add_url_rule("/api/text-to-speech/coqui/local/get-models", view_func=coqui_module.coqui_get_local_models, methods=["POST"]) + + # Handle both coqui-api/users models + app.add_url_rule("/api/text-to-speech/coqui/generate-tts", view_func=coqui_module.coqui_generate_tts, methods=["POST"]) + +# Read an API key from an already existing file. If that file doesn't exist, create it. +if args.secure: + try: + with open("api_key.txt", "r") as txt: + api_key = txt.read().replace('\n', '') + except Exception: + api_key = secrets.token_hex(5) + with open("api_key.txt", "w") as txt: + txt.write(api_key) + + print(f"{Fore.YELLOW}{Style.BRIGHT}Your API key is {api_key}{Style.RESET_ALL}") +elif args.share and not args.secure: + print(f"{Fore.RED}{Style.BRIGHT}WARNING: This instance is publicly exposed without an API key! It is highly recommended to restart with the \"--secure\" argument!{Style.RESET_ALL}") +else: + print(f"{Fore.YELLOW}{Style.BRIGHT}No API key given because you are running locally.{Style.RESET_ALL}") if args.share: from flask_cloudflared import _run_cloudflared import inspect sig = inspect.signature(_run_cloudflared) - sum = sum( - 1 - for param in sig.parameters.values() - if param.kind == param.POSITIONAL_OR_KEYWORD - ) - if sum > 1: + nparams = sum(1 for param in sig.parameters.values() if param.kind == param.POSITIONAL_OR_KEYWORD) + if nparams > 1: metrics_port = randint(8100, 9000) cloudflare = _run_cloudflared(port, metrics_port) else: diff --git a/talkinghead/Character Card Guide.png b/talkinghead/Character_Card_Guide.png similarity index 100% rename from talkinghead/Character Card Guide.png rename to talkinghead/Character_Card_Guide.png diff --git a/talkinghead/README.md b/talkinghead/README.md index a435efc..9577f6a 100644 --- a/talkinghead/README.md +++ b/talkinghead/README.md @@ -6,11 +6,21 @@ - [Talkinghead](#talkinghead) - [Introduction](#introduction) - [Live mode](#live-mode) + - [Configuration](#configuration) + - [Emotion templates](#emotion-templates) + - [Animator configuration](#animator-configuration) + - [Postprocessor configuration](#postprocessor-configuration) + - [Postprocessor example: HDR, scifi hologram](#postprocessor-example-hdr-scifi-hologram) + - [Postprocessor example: cheap video camera, amber monochrome computer monitor](#postprocessor-example-cheap-video-camera-amber-monochrome-computer-monitor) + - [Postprocessor example: HDR, cheap video camera, 1980s VHS tape](#postprocessor-example-hdr-cheap-video-camera-1980s-vhs-tape) + - [Complete example: animator and postprocessor settings](#complete-example-animator-and-postprocessor-settings) - [Manual poser](#manual-poser) - [Troubleshooting](#troubleshooting) + - [Low framerate](#low-framerate) + - [Low VRAM - what to do?](#low-vram---what-to-do) - [Missing model at startup](#missing-model-at-startup) - [Creating a character](#creating-a-character) - - [Tips for Stable Diffusion](#tips-for-stable-diffusion) + - [Tips for Stable Diffusion](#tips-for-stable-diffusion) - [Acknowledgements](#acknowledgements) @@ -19,36 +29,251 @@ This module renders a **live, AI-based custom anime avatar for your AI character**. -The end result is similar to that generated by VTuber software such as *Live2D*, but this works differently. We use the THA3 AI posing engine, which takes **a single static image** of the character as input. It can vary the character's expression, and pose some joints by up to 15 degrees. Modern GPUs have enough compute to do this in realtime. +In contrast to VTubing software, `talkinghead` is an **AI-based** character animation technology, which produces animation from just **one static 2D image**. This makes creating new characters accessible and cost-effective. All you need is Stable Diffusion and an image editor to get started! Additionally, you can experiment with your character's appearance in an agile way, animating each revision of your design. -This has some implications: +The animator is built on top of a deep learning model, so optimal performance requires a fast GPU. The model can vary the character's expression, and pose some joints by up to 15 degrees. This allows producing parametric animation on the fly, just like from a traditional 2D or 3D model - but from a small generative AI. Modern GPUs have enough compute to do this in realtime. -- You can produce new characters in a fast and agile manner. - - One expression is enough. No need to make 28 manually. - - If you need to modify some details in the character's outfit, just edit the image (either manually, or by Stable Diffusion/ControlNet). -- We can produce parametric animation on the fly, just like from a traditional 2D or 3D model - but the model is a generative AI. +You only need to provide **one** expression for your character. The model automatically generates the rest of the 28, and seamlessly animates between them. The expressions are based on *emotion templates*, which are essentially just morph settings. To make it convenient to edit the templates, we provide a GUI editor (the manual poser), where you can see how the resulting expression looks on your character. -As with any AI technology, there are limitations. The AI-generated output image may not be perfect, and in particular the model does not support characters wearing large hats or props. For details (and example outputs), refer to the original author's [tech report](https://web.archive.org/web/20220606125507/https://pkhungurn.github.io/talking-head-anime-3/full.html). +As with any AI technology, there are limitations. The AI-generated animation frames may not look perfect, and in particular the model does not support characters wearing large hats or props. For details (and many example outputs), refer to the [tech report](https://web.archive.org/web/20220606125507/https://pkhungurn.github.io/talking-head-anime-3/full.html) by the model's original author. -Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here; but if you have the hardware, try it, you might like it. If you prefer still images, and don't create new characters often, you may get better results by inpainting expression sprites in Stable Diffusion. +Still images do not do the system justice; the realtime animation is a large part of its appeal. Preferences vary here. If you have the hardware, try it, you might like it. Especially, if you like to make new characters, or to tweak your character design often, this is the animator for you. On the other hand, if you prefer still images, and focus on one particular design, you may get more aesthetically pleasing results by inpainting static expression sprites in Stable Diffusion. + +Currently, `talkinghead` is focused on providing 1-on-1 interactions with your AI character, so support for group chats and visual novel mode are not included, nor planned. However, as a community-driven project, we appreciate any feedback or especially code or documentation contributions towards the growth and development of this extension. ### Live mode -The live mode is activated by: +To activate the live mode: -- Loading the `talkinghead` module in *SillyTavern-extras*, and -- In *SillyTavern* settings, checking the checkbox *Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras)*. -- Your character must have a `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png` for this to work. You can upload one in the settings. +- Configure your *SillyTavern-extras* installation so that it loads the `talkinghead` module. This makes the backend available. +- Ensure that your character has a `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png`. This is the input image for the animator. + - You can upload one in the *SillyTavern* settings, in *Extensions ⊳ Character Expressions*. +- To enable **talkinghead mode** in *Character Expressions*, check the checkbox *Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras)*. -CUDA (*SillyTavern-extras option* `--talkinghead-gpu`) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the `separable_half` THA3 model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H. +CUDA (*SillyTavern-extras* option `--talkinghead-gpu`) is very highly recommended. As of late 2023, a recent GPU is also recommended. For example, on a laptop with an RTX 3070 Ti mobile GPU, and the `separable_half` THA3 poser model (fastest and smallest; default when running on GPU), you can expect ≈40-50 FPS render performance. VRAM usage in this case is about 520 MB. CPU mode exists, but is very slow, about ≈2 FPS on an i7-12700H. We rate-limit the output to 25 FPS (maximum) to avoid DoSing the SillyTavern GUI, and attempt to reach a constant 25 FPS. If the renderer runs faster, the average GPU usage will be lower, because the animation engine only generates as many frames as are actually consumed. If the renderer runs slower, the latest available frame will be re-sent as many times as needed, to isolate the client side from any render hiccups. -To customize which THA3 model to use, and where to install the THA3 models from, see the `--talkinghead-model=...` and `--talkinghead-models=...` options, respectively. +To customize which model variant of the THA3 poser to use, and where to install the models from, see the `--talkinghead-model=...` and `--talkinghead-models=...` options, respectively. If the directory `talkinghead/tha3/models/` (under the top level of *SillyTavern-extras*) does not exist, the model files are automatically downloaded from HuggingFace and installed there. +#### Configuration + +The live mode is configured per-character, via files **at the client end**: + +- `SillyTavern/public/characters/yourcharacternamehere/talkinghead.png`: required. The input image for the animator. + - The `talkinghead` extension does not use or even see the other `.png` files. They are used by *Character Expressions* when *talkinghead mode* is disabled. +- `SillyTavern/public/characters/yourcharacternamehere/_animator.json`: optional. Animator and postprocessor settings. + - If a character does not have this file, default settings are used. +- `SillyTavern/public/characters/yourcharacternamehere/_emotions.json`: optional. Custom emotion templates. + - If a character does not have this file, default settings are used. Most of the time, there is no need to customize the emotion templates per-character. + - At the client end, only this one file is needed (or even supported) to customize the emotion templates. + +#### Emotion templates + +Emotion templates use the same format as the factory settings in `SillyTavern-extras/talkinghead/emotions/_defaults.json`. The manual poser app included with `talkinghead` is a GUI editor for these templates. + +The batch export of the manual poser produces a set of static expression images (and corresponding emotion templates), but also an `_emotions.json`, in your chosen output folder. You can use this file at the client end as `SillyTavern/public/characters/yourcharacternamehere/_emotions.json`. This is convenient if you have customized your emotion templates, and wish to share one of your characters with other users, making it automatically use your version of the templates. + +Emotion template lookup order is: + +- The set of custom templates sent by the ST client, read from `SillyTavern/public/characters/yourcharacternamehere/_emotions.json` if it exists. +- Server defaults, from the individual files `SillyTavern-extras/talkinghead/emotions/emotionnamehere.json`. + - These are customizable. You can e.g. overwrite `curiosity.json` to change the default template for the *"curiosity"* emotion. + - **IMPORTANT**: *However, updating SillyTavern-extras from git may overwrite your changes to the server-side default emotion templates.* +- Factory settings, from `SillyTavern-extras/talkinghead/emotions/_defaults.json`. + - **IMPORTANT**: Never overwrite or remove this file. + +Any emotion that is missing from a particular level in the lookup order falls through to be looked up at the next level. + +If you want to edit the emotion templates manually (without using the GUI) for some reason, the following may be useful sources of information: + +- `posedict_keys` in [`talkinghead/tha3/app/util.py`](tha3/app/util.py) lists the morphs available in THA3. +- [`talkinghead/tha3/poser/modes/pose_parameters.py`](tha3/poser/modes/pose_parameters.py) contains some more detail. + - *"Arity 2"* means `posedict_keys` has separate left/right morphs. +- The GUI panel implementations in [`talkinghead/tha3/app/manual_poser.py`](tha3/app/manual_poser.py). + +Any morph that is not mentioned for a particular emotion defaults to zero. Thus only those morphs that have nonzero values need to be mentioned. + + +#### Animator configuration + +*The available settings keys and examples are kept up-to-date on a best-effort basis, but there is a risk of this documentation being out of date. When in doubt, refer to the actual source code, which comes with extensive docstrings and comments. The final authoritative source is the implementation itself.* + +The file `SillyTavern/public/characters/yourcharacternamehere/_animator.json` contains the animator and postprocessor settings. For any setting not specified in the file, the default value is used. + +The idea is that this allows giving some personality to different characters; for example, they may sway by different amounts, the breathing cycle duration may be different, and importantly, the postprocessor settings may be different - which allows e.g. making a specific character into a scifi hologram, while others render normally. + +Here is a complete example of `_animator.json`, showing the default values: + +``` +{"target_fps": 25, + "pose_interpolator_step": 0.1, + "blink_interval_min": 2.0, + "blink_interval_max": 5.0, + "blink_probability": 0.03, + "blink_confusion_duration": 10.0, + "talking_fps": 12, + "talking_morph": "mouth_aaa_index", + "sway_morphs": ["head_x_index", "head_y_index", "neck_z_index", "body_y_index", "body_z_index"], + "sway_interval_min": 5.0, + "sway_interval_max": 10.0, + "sway_macro_strength": 0.6, + "sway_micro_strength": 0.02, + "breathing_cycle_duration": 4.0, + "postprocessor_chain": []} +``` + +where: + +- `target_fps`: Desired output frames per second. Note this only affects smoothness of the output (provided that the hardware is fast enough). The speed at which the animation evolves is based on wall time. Snapshots are rendered at the target FPS, or if the hardware is slower, then as often as hardware allows. *Recommendation*: For smooth animation, make the FPS lower than what your hardware could produce, so that some compute remains untapped, available to smooth over the occasional hiccup from other running programs. +- `pose_interpolator_step`: A value such that `0 < step <= 1`. Applied at each frame at a reference of 25 FPS (to standardize the meaning of the setting), with automatic internal FPS-correction to the actual output FPS. Note that the animation is nonlinear. The step controls how much of the *remaining distance* to the current target pose is covered in 1/25 seconds. +- `blink_interval_min`: seconds. After blinking, lower limit for random minimum time until next blink is allowed. +- `blink_interval_max`: seconds. After blinking, upper limit for random minimum time until next blink is allowed. +- `blink_probability`: Applied at each frame at a reference of 25 FPS, with automatic internal FPS-correction to the actual output FPS. This is the probability of initiating a blink in each 1/25 second interval. +- `blink_confusion_duration`: seconds. Upon entering the `"confusion"` emotion, the character may blink quickly in succession, temporarily disregarding the blink interval settings. This sets how long that state lasts. +- `talking_fps`: How often to re-randomize the mouth during the talking animation. The default value is based on the fact that early 2000s anime used ~12 FPS as the fastest actual framerate of new cels, not counting camera panning effects and such. +- `talking_morph`: Which mouth-open morph to use for talking. For available values, see `posedict_keys` in [`talkinghead/tha3/app/util.py`](tha3/app/util.py). +- `sway_morphs`: Which morphs participate in the sway (fidgeting) animation. This setting is mainly useful for disabling some or all of them, e.g. for a robot character. For available values, see `posedict_keys` in [`talkinghead/tha3/app/util.py`](tha3/app/util.py). +- `sway_interval_min`: seconds. Lower limit for random time interval until randomizing a new target pose for the sway animation. +- `sway_interval_max`: seconds. Upper limit for random time interval until randomizing a new target pose for the sway animation. +- `sway_macro_strength`: A value such that `0 < strength <= 1`. In the sway target pose, this sets the maximum absolute deviation from the target pose specified by the current emotion, but also the maximum deviation from the center position. The setting is applied to each sway morph separately. The emotion pose itself may use higher values for the morphs; in such cases, sway will only occur toward the center. For details, see `compute_sway_target_pose` in [`talkinghead/tha3/app/app.py`](tha3/app/app.py). +- `sway_micro_strength`: A value such that `0 < strength <= 1`. This is the maximum absolute value of random noise added to the sway target pose at each frame. To this, no limiting is applied, other than a clamp of the final randomized value of each sway morph to the valid range [-1, 1]. A small amount of random jitter makes the character look less robotic. +- `breathing_cycle_duration`: seconds. The duration of a full cycle of the breathing animation. +- `postprocessor_chain`: Pixel-space glitch artistry settings. The default is empty (no postprocessing); see below for examples of what can be done with this. For details, see [`talkinghead/tha3/app/postprocessor.py`](tha3/app/postprocessor.py). + +#### Postprocessor configuration + +*The available settings keys and examples are kept up-to-date on a best-effort basis, but there is a risk of this documentation being out of date. When in doubt, refer to the actual source code, which comes with extensive docstrings and comments. The final authoritative source is the implementation itself.* + +The postprocessor configuration is part of `_animator.json`, stored under the key `"postprocessor_chain"`. + +Postprocessing requires some additional compute, depending on the filters used and their settings. When `talkinghead` runs on the GPU, also the postprocessor filters run on the GPU. In gaming technology terms, they are essentially fragment shaders, implemented in PyTorch. + +The filters in the postprocessor chain are applied to the image in the order in which they appear in the list. That is, the filters themselves support rendering in any order. However, for best results, it is useful to keep in mind the process a real physical signal would travel through: + +*Light* ⊳ *Camera* ⊳ *Transport* ⊳ *Display* + +and set the order for the filters based on that. However, this does not mean that there is just one correct ordering. Some filters are *general-use*, and may make sense at several points in the chain, depending on what you wish to simulate. Feel free to improvise, but make sure to understand why your filter chain makes sense. + +The following postprocessing filters are available. Options for each filter are documented in the docstrings in [`talkinghead/tha3/app/postprocessor.py`](tha3/app/postprocessor.py). + +**Light**: + +- `bloom`: Bloom effect (fake HDR). Popular in early 2000s anime. Makes bright parts of the image bleed light into their surroundings, enhancing perceived contrast. Only makes sense when the talkinghead is rendered on a relatively dark background (such as the cyberpunk bedroom in the ST default backgrounds). + +**Camera**: + +- `chromatic_aberration`: Simulates the two types of chromatic aberration in a camera lens, axial (index of refraction varying w.r.t. wavelength) and transverse (focal distance varying w.r.t. wavelength). +- `vignetting`: Simulates vignetting, i.e. less light hitting the corners of a film frame or CCD sensor, causing the corners to be slightly darker than the center. + +**Transport**: + +Currently, we provide some filters that simulate a lo-fi analog video look. + +- `analog_lowres`: Simulates a low-resolution analog video signal by blurring the image. +- `analog_badhsync`: Simulates bad horizontal synchronization (hsync) of an analog video signal, causing a wavy effect that causes the outline of the character to ripple. +- `analog_vhsglitches`: Simulates a damaged 1980s VHS tape. In each frame, causes random lines to glitch with VHS noise. +- `analog_vhstracking`: Simulates a 1980s VHS tape with bad tracking. The image floats up and down, and a band of VHS noise appears at the bottom. + +**Display**: + +- `translucency`: Makes the character translucent, as if a scifi hologram. +- `banding`: Simulates the look of a CRT display as it looks when filmed on video without syncing. Brighter and darker bands travel through the image. +- `scanlines`: Simulates CRT TV like scanlines. Optionally dynamic (flipping the dimmed field at each frame). + - From my experiments with the Phosphor deinterlacer in VLC, which implements the same effect, dynamic mode for `scanlines` would look *absolutely magical* when synchronized with display refresh, closely reproducing the look of an actual CRT TV. However, that is not possible here. Thus, it looks best at low but reasonable FPS, and a very high display refresh rate, so that small timing variations will not make much of a difference in how long a given field is actually displayed on the physical monitor. + - If the timing is too uneven, the illusion breaks. In that case, consider using the static mode (`"dynamic": false`). + +**General use**: + +- `alphanoise`: Adds noise to the alpha channel (translucency). +- `desaturate`: A desaturation filter with bells and whistles. Beside converting the image to grayscale, can optionally pass through colors that match the hue of a given RGB color (e.g. keep red things, while desaturating the rest), and tint the final result (e.g. for an amber monochrome computer monitor look). + +The `alphanoise` filter could represent the display of a lo-fi scifi hologram, as well as noise in an analog video tape (which in this scheme belongs to "transport"). + +The `desaturate` filter could represent either a black and white video camera, or a monochrome display. + +#### Postprocessor example: HDR, scifi hologram + +The bloom works best on a dark background. We use `alphanoise` to add an imperfection to the simulated display device, causing individual pixels to dynamically vary in their alpha. The `banding` and `scanlines` filters complete the look of how holograms are often depicted in scifi video games and movies. The `"dynamic": true` makes the dimmed field (top or bottom) flip each frame, like on a CRT television. + +``` +"postprocessor_chain": [["bloom", {}], + ["translucency", {"alpha": 0.9}], + ["alphanoise", {"magnitude": 0.1, "sigma": 0.0}], + ["banding", {}], + ["scanlines", {"dynamic": true}] + ] +``` + +#### Postprocessor example: cheap video camera, amber monochrome computer monitor + +We first simulate a cheap video camera with low-quality optics via the `chromatic_aberration` and `vignetting` filters. + +We then use `desaturate` with the tint option to produce the amber monochrome look. + +The `banding` and `scanlines` filters suit this look, so we apply them here, too. They could be left out to simulate a higher-quality display device. Setting `"dynamic": false` makes the scanlines stay stationary. + +``` +"postprocessor_chain": [["chromatic_aberration", {}], + ["vignetting", {}], + ["desaturate", {"tint_rgb": [1.0, 0.5, 0.2]}], + ["banding", {}], + ["scanlines", {"dynamic": false}] + ] +``` + +#### Postprocessor example: HDR, cheap video camera, 1980s VHS tape + +After capturing the light with a cheap video camera (just like in the previous example), we simulate the effects of transporting the signal on a 1980s VHS tape. First, we blur the image with `analog_lowres`. Then we apply `alphanoise` with a nonzero `sigma` to make the noise blobs larger than a single pixel, and a rather high `magnitude`. This simulates the brightness noise on a VHS tape. Then we make the image ripple horizontally with `analog_badhsync`, and finally add a bad VHS tracking effect to complete the look. + +Then we again render the output on a simulated CRT TV, as appropriate for the 1980s time period. + +``` +"postprocessor_chain": [["bloom", {}], + ["chromatic_aberration", {}], + ["vignetting", {}], + ["analog_lowres", {}], + ["alphanoise", {"magnitude": 0.2, "sigma": 2.0}], + ["analog_badhsync", {}], + ["analog_vhstracking", {}], + ["banding", {}], + ["scanlines", {"dynamic": true}] + ] +``` + +#### Complete example: animator and postprocessor settings + +This example uses the default values for the animator (to give a template that is easy to tune), but sets up the postprocessor as in the "scifi hologram" example above. + +To use this, save this **at the client end** as `SillyTavern/public/characters/yourcharacternamehere/_animator.json` (i.e. as `_animator.json`, in the same folder where the sprites for your character are), and make `talkinghead` reload your character (for example, by toggling `talkinghead` off and back on in the SillyTavern settings). + +```json +{"target_fps": 25, + "pose_interpolator_step": 0.1, + "blink_interval_min": 2.0, + "blink_interval_max": 5.0, + "blink_probability": 0.03, + "blink_confusion_duration": 10.0, + "talking_fps": 12, + "talking_morph": "mouth_aaa_index", + "sway_morphs": ["head_x_index", "head_y_index", "neck_z_index", "body_y_index", "body_z_index"], + "sway_interval_min": 5.0, + "sway_interval_max": 10.0, + "sway_macro_strength": 0.6, + "sway_micro_strength": 0.02, + "breathing_cycle_duration": 4.0, + "postprocessor_chain": [["bloom", {}], + ["translucency", {"alpha": 0.9}], + ["alphanoise", {"magnitude": 0.1, "sigma": 0.0}], + ["banding", {}], + ["scanlines", {"dynamic": true}] + ] +} +``` + ### Manual poser @@ -74,6 +299,7 @@ To run the manual poser: - `conda activate extras` - `python -m tha3.app.manual_poser`. - For systems with `bash`, a convenience wrapper `./start_manual_poser.sh` is included. + Run the poser with the `--help` option for a description of its command-line options. The command-line options of the manual poser are **completely independent** from the options of *SillyTavern-extras* itself. Currently, you can choose the device to run on (GPU or CPU), and which THA3 model to use. By default, the manual poser uses GPU and the `separable_float` model. @@ -85,9 +311,23 @@ To load a PNG image or emotion JSON, you can either use the buttons, their hotke ### Troubleshooting +#### Low framerate + +The poser is a deep-learning model. Each animation frame requires an inference pass. This requires lots of compute. + +Thus, if you have a CUDA-capable GPU, enable GPU support by using the `--talkinghead-gpu` setting of *SillyTavern-extras*. + +CPU mode is very slow, and without a redesign of the AI model (or distillation, like in the newer [THA4 paper](https://arxiv.org/abs/2311.17409)), there is not much that can be done. It is already running as fast as PyTorch can go, and the performance impact of everything except the posing engine is almost negligible. + +#### Low VRAM - what to do? + +Observe that the `--talkinghead-gpu` setting is independent of the CUDA device setting of the rest of *SillyTavern-extras*. + +So in a low-VRAM environment such as a gaming laptop, you can run just `talkinghead` on the GPU (VRAM usage about 520 MB) to get acceptable animation performance, while running all other extras modules on the CPU. The `classify` or `summarize` AI modules do not require realtime performance, whereas `talkinghead` does. + #### Missing model at startup -The `separable_float` variant of the THA3 models was previously included in the *SillyTavern-extras* repository. However, this was recently (December 2023) changed to download these models from HuggingFace if necessary, so a local copy of the model is no longer provided. +The `separable_float` variant of the THA3 models was previously included in the *SillyTavern-extras* repository. However, `talkinghead` was recently (December 2023) changed to download these models from HuggingFace if necessary, so a local copy of the model is no longer provided in the repository. Therefore, if you updated your *SillyTavern-extras* installation from *git*, it is likely that *git* deleted your local copy of that particular model, leading to an error message like: @@ -95,9 +335,9 @@ Therefore, if you updated your *SillyTavern-extras* installation from *git*, it FileNotFoundError: Model file /home/xxx/SillyTavern-extras/talkinghead/tha3/models/separable_float/eyebrow_decomposer.pt not found, please check the path. ``` -The solution is to remove (or rename) your `SillyTavern-extras/talkinghead/tha3/models` directory, and try again. If that directory does not exist, `talkinghead` will download the models at the first run. +The solution is to remove (or rename) your `SillyTavern-extras/talkinghead/tha3/models` directory, and restart *SillyTavern-extras*. If the model directory does not exist, `talkinghead` will download the models at the first run. -The models are shared between the live mode and the manual poser, so it doesn't matter which one you run first. +The models are actually shared between the live mode and the manual poser, so it doesn't matter which one you run first. ### Creating a character @@ -107,22 +347,25 @@ To create an AI avatar that `talkinghead` understands: - The image must be of size 512x512, in PNG format. - **The image must have an alpha channel**. - Any pixel with nonzero alpha is part of the character. - - If the edges of the silhouette look like a cheap photoshop job, check them manually for background bleed. + - If the edges of the silhouette look like a cheap photoshop job (especially when ST renders the character on a different background), check them manually for background bleed. - Using any method you prefer, create a front view of your character within [these specifications](Character_Card_Guide.png). - In practice, you can create an image of the character in the correct pose first, and align it as a separate step. - If you use Stable Diffusion, see separate section below. + - **IMPORTANT**: *The character's eyes and mouth must be open*, so that the model sees what they look like when open. + - See [the THA3 example character](tha3/images/example.png). + - If that's easier to produce, an open-mouth smile also works. - To add an alpha channel to an image that has the character otherwise fine, but on a background: - In Stable Diffusion, you can try the [rembg](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) extension for Automatic1111 to get a rough first approximation. - Also, you can try the *Fuzzy Select* (magic wand) tool in traditional image editors such as GIMP or Photoshop. - Manual pixel-per-pixel editing of edges is recommended for best results. Takes about 20 minutes per character. - If you rendered the character on a light background, use a dark background layer when editing the edges, and vice versa. - This makes it much easier to see which pixels have background bleed and need to be erased. -- Finally, align the character on the canvas. +- Finally, align the character on the canvas to conform to the placement the THA3 posing engine expects. - We recommend using [the THA3 example character](tha3/images/example.png) as an alignment template. - **IMPORTANT**: Export the final edited image, *without any background layer*, as a PNG with an alpha channel. - Load up the result into *SillyTavern* as a `talkinghead.png`, and see how well it performs. -### Tips for Stable Diffusion +#### Tips for Stable Diffusion It is possible to create a suitable character render with Stable Diffusion. We assume that you already have a local installation of the [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui-rembg) webui. @@ -177,6 +420,6 @@ One possible solution is to ask for a `full body shot`, and *txt2img* for a good ### Acknowledgements -This software incorporates the [THA3](https://github.com/pkhungurn/talking-head-anime-3-demo) AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license. +This software incorporates the [THA3](https://github.com/pkhungurn/talking-head-anime-3-demo) AI-based anime posing engine developed by Pramook Khungurn. The THA3 code is used under the MIT license, and the THA3 AI models are used under the Creative Commons Attribution 4.0 International license. The THA3 example character is used under the Creative Commons Attribution-NonCommercial 4.0 International license. The trained models are currently mirrored [on HuggingFace](https://huggingface.co/OktayAlpk/talking-head-anime-3). -The manual poser code has been mostly rewritten, and the live mode code is original to this software. +In this software, the manual poser code has been mostly rewritten, and the live mode code is original to `talkinghead`. diff --git a/talkinghead/TODO.md b/talkinghead/TODO.md index e220958..2e0c980 100644 --- a/talkinghead/TODO.md +++ b/talkinghead/TODO.md @@ -2,25 +2,11 @@ ### Live mode -- Add optional per-character configuration - - At client end, JSON files in `SillyTavern/public/characters/characternamehere/` - - Pass the data all the way here (from ST client, to ST server, to ST-extras server, to talkinghead module) - - Configuration: - - Target FPS (default 25.0) - - Postprocessor effect chain (including settings) - - Animation parameters (ideally per character) - - Blink timing: `blink_interval` min/max (when randomizing the next blink timing) - - Blink probability per frame - - "confusion" emotion initial segment duration (where blinking quickly in succession is allowed) - - Sway timing: `sway_interval` min/max (when randomizing the next sway timing) - - Sway strength (`max_random`, `max_noise`) - - Breathing cycle duration - - Emotion templates - - One JSON file per emotion, like for the server default templates? This format is easily produced by the manual poser GUI tool. - - Could be collected by the client into a single JSON for sending. - - Need also global defaults - - These could live at the SillyTavern-extras server end - - Still, don't hardcode, but read from JSON file, to keep easily configurable +- Add a server-side config for animator and postprocessor settings. + - For symmetry with emotion handling; but also foreseeable that target FPS is an installation-wide thing instead of a character-wide thing. + Currently we don't have a way to set it installation-wide. +- Fix timing of microsway based on 25 FPS reference. +- Fix timing of dynamic postprocessor effects, these should also use a 25 FPS reference. - Add live-modifiable configuration for animation and postprocessor settings? - Add a new control panel to SillyTavern client extension settings - Send new configs to backend whenever anything changes @@ -41,7 +27,7 @@ - Investigate if some particular emotions could use a small random per-frame oscillation applied to "iris_small", for that anime "intense emotion" effect (since THA3 doesn't have a morph specifically for the specular reflections in the eyes). -### Client-side bugs / missing features: +### Client side - If `classify` is enabled, emotion state should be updated from the latest AI-generated text when switching chat files, to resume in the same emotion state where the chat left off. @@ -49,21 +35,37 @@ then the "set_emotion" endpoint. - When a new talkinghead sprite is uploaded: - The preview thumbnail in the client doesn't update. +- Not related to talkinghead, but client bug, came up during testing: in *Manage chat files*, when using the search feature, + clicking on a search result either does nothing, or opens the wrong chat. When not searching, clicking on a previous chat + correctly opens that specific chat. +- Are there other places in *Character Expressions* (`SillyTavern/public/scripts/extensions/expressions/index.js`) + where we need to check whether the `talkinghead` module is enabled? `(!isTalkingHeadEnabled() || !modules.includes('talkinghead'))` +- Check zip upload whether it refreshes the talkinghead character (it should). ### Common -- Add pictures to the README. +- Add pictures to the talkinghead README. - Screenshot of the manual poser. Anything else the user needs to know about it? - Examples of generated poses, highlighting both success and failure cases. How the live talking head looks in the actual SillyTavern GUI. -- Document postprocessor filters and their settings in the README, with example pictures. + - Examples of postprocessor filter results. - Merge appropriate material from old user manual into the new README. -- Update the user manual. +- Update/rewrite the user manual, based on the new README. - Far future: - - Lip-sync talking animation to TTS output (need realtime data from client) - - THA3 has morphs for A, I, U, E, O, and the "mouth delta" shape Δ. + - To save GPU resources, automatically pause animation when the web browser window with SillyTavern is not in focus. Resume when it regains focus. + - Needs a new API endpoint for pause/resume. Note the current `/api/talkinghead/unload` is actually a pause function (the client pauses, and + then just hides the live image), but there is currently no resume function (except `/api/talkinghead/load`, which requires sending an image file). - Fast, high-quality scaling mechanism. - On a 4k display, the character becomes rather small, which looks jarring on the default backgrounds. - The algorithm should be cartoon-aware, some modern-day equivalent of waifu2x. A GAN such as 4x-AnimeSharp or Remacri would be nice, but too slow. - Maybe the scaler should run at the client side to avoid the need to stream 1024x1024 PNGs. - What JavaScript anime scalers are there, or which algorithms are simple enough for a small custom implementation? + - Lip-sync talking animation to TTS output. + - THA3 has morphs for A, I, U, E, O, and the "mouth delta" shape Δ. + - This needs either: + - Realtime data from client + - Or if ST-extras generates the TTS output, then at least a start timestamp for the playback of a given TTS output audio file, + and a possibility to stop animating if the user stops the audio. + - Postprocessor for static character expression sprites. + - This would need reimplementing the static sprite system at the `talkinghead` end (so that we can apply per-frame dynamic postprocessing), + and then serving that as `result_feed`. - Group chats / visual novel mode / several talkingheads running simultaneously. diff --git a/talkinghead/tha3/app/app.py b/talkinghead/tha3/app/app.py index 18f4335..603c5ad 100644 --- a/talkinghead/tha3/app/app.py +++ b/talkinghead/tha3/app/app.py @@ -23,7 +23,7 @@ import sys import time import numpy as np import threading -from typing import Dict, List, NoReturn, Optional, Union +from typing import Any, Dict, List, NoReturn, Optional, Union import PIL @@ -45,6 +45,36 @@ logger = logging.getLogger(__name__) # -------------------------------------------------------------------------------- # Global variables +# Default configuration for the animator, loaded when the plugin is launched. +# Doubles as the authoritative documentation of the animator settings (beside the animation driver docstrings and the actual source code). +animator_defaults = {"target_fps": 25, # Desired output frames per second. Note this only affects smoothness of the output (if hardware allows). + # The speed at which the animation evolves is based on wall time. Snapshots are rendered at the target FPS, + # or if the hardware is too slow to reach the target FPS, then as often as hardware allows. + # For smooth animation, make the FPS lower than what your hardware could produce, so that some compute + # remains untapped, available to smooth over the occasional hiccup from other running programs. + "pose_interpolator_step": 0.1, # 0 < this <= 1; at each frame at a reference of 25 FPS; FPS-corrected automatically; see `interpolate_pose`. + + "blink_interval_min": 2.0, # seconds, lower limit for random minimum time until next blink is allowed. + "blink_interval_max": 5.0, # seconds, upper limit for random minimum time until next blink is allowed. + "blink_probability": 0.03, # At each frame at a reference of 25 FPS; FPS-corrected automatically. + "blink_confusion_duration": 10.0, # seconds, upon entering "confusion" emotion, during which blinking quickly in succession is allowed. + + "talking_fps": 12, # How often to re-randomize mouth during talking animation. + # Early 2000s anime used ~12 FPS as the fastest actual framerate of new cels (not counting camera panning effects and such). + "talking_morph": "mouth_aaa_index", # which mouth-open morph to use for talking; for available values, see `posedict_keys` + + "sway_morphs": ["head_x_index", "head_y_index", "neck_z_index", "body_y_index", "body_z_index"], # which morphs to sway; see `posedict_keys` + "sway_interval_min": 5.0, # seconds, lower limit for random time interval until randomizing new sway pose. + "sway_interval_max": 10.0, # seconds, upper limit for random time interval until randomizing new sway pose. + "sway_macro_strength": 0.6, # [0, 1], in sway pose, max abs deviation from emotion pose target morph value for each sway morph, + # but also max deviation from center. The emotion pose itself may use higher values; in such cases, + # sway will only occur toward the center. See `compute_sway_target_pose` for details. + "sway_micro_strength": 0.02, # [0, 1], max abs random noise added each frame. No limiting other than a clamp of final pose to [-1, 1]. + + "breathing_cycle_duration": 4.0, # seconds, for a full breathing cycle. + + "postprocessor_chain": []} # Pixel-space glitch artistry settings; see `postprocessor.py`. + talkinghead_basedir = "talkinghead" global_animator_instance = None @@ -61,7 +91,7 @@ current_emotion = "neutral" is_talking = False global_reload_image = None -TARGET_FPS = 25 +target_fps = 25 # value overridden by `load_animator_settings` at animator startup # -------------------------------------------------------------------------------- # API @@ -199,7 +229,7 @@ def result_feed() -> Response: # - Excessive spamming can DoS the SillyTavern GUI, so there needs to be a rate limit. # - OTOH, we must constantly send something, or the GUI will lock up waiting. # Therefore, send at a target FPS that yields a nice-looking animation. - frame_duration_target_sec = 1 / TARGET_FPS + frame_duration_target_sec = 1 / target_fps if last_frame_send_complete_time is not None: time_now = time.time_ns() this_frame_elapsed_sec = (time_now - last_frame_send_complete_time) / 10**9 @@ -232,7 +262,7 @@ def result_feed() -> Response: msec = round(1000 * avg_send_sec, 1) target_msec = round(1000 * frame_duration_target_sec, 1) fps = round(1 / avg_send_sec, 1) if avg_send_sec > 0.0 else 0.0 - logger.info(f"output: {msec:.1f}ms [{fps:.1f} FPS]; target {target_msec:.1f}ms [{TARGET_FPS:.1f} FPS]") + logger.info(f"output: {msec:.1f}ms [{fps:.1f} FPS]; target {target_msec:.1f}ms [{target_fps:.1f} FPS]") last_report_time = time_now else: # first frame not yet available @@ -281,6 +311,7 @@ def launch(device: str, model: str) -> Union[None, NoReturn]: global_encoder_instance.exit() global_encoder_instance = None + logger.info("launch: loading the THA3 posing engine") poser = load_poser(model, device, modelsdir=os.path.join(talkinghead_basedir, "tha3", "models")) global_animator_instance = Animator(poser, device) global_encoder_instance = Encoder() @@ -312,8 +343,6 @@ class Animator: self.poser = poser self.device = device - self.reset_animation_state() - self.postprocessor = Postprocessor(device) self.render_duration_statistics = RunningAverage() self.animator_thread = None @@ -323,7 +352,9 @@ class Animator: self.new_frame_available = False self.last_report_time = None - self.emotions, self.emotion_names = load_emotion_presets(os.path.join("talkinghead", "emotions")) + self.reset_animation_state() + self.load_emotion_templates() + self.load_animator_settings() # -------------------------------------------------------------------------------- # Management @@ -372,6 +403,92 @@ class Animator: self.breathing_epoch = time.time_ns() + def load_emotion_templates(self, emotions: Optional[Dict[str, Dict[str, float]]] = None) -> None: + """Load emotion templates. + + `emotions`: `{emotion0: {morph0: value0, ...}, ...}` + Optional dict of custom emotion templates. + + If not given, this loads the templates from the emotion JSON files + in `talkinghead/emotions/`. + + If given: + - Each emotion NOT supplied is populated from the defaults. + - In each emotion that IS supplied, each morph that is NOT mentioned + is implicitly set to zero (due to how `apply_emotion_to_pose` works). + + For an example JSON file containing a suitable dictionary, see `talkinghead/emotions/_defaults.json`. + + For available morph names, see `posedict_keys` in `talkinghead/tha3/app/util.py`. + + For some more detail, see `talkinghead/tha3/poser/modes/pose_parameters.py`. + "Arity 2" means `posedict_keys` has separate left/right morphs. + + If still in doubt, see the GUI panel implementations in `talkinghead/tha3/app/manual_poser.py`. + """ + # Load defaults as a base + self.emotions, self.emotion_names = load_emotion_presets(os.path.join("talkinghead", "emotions")) + + # Then override defaults, and add any new custom emotions + if emotions is not None: + logger.info(f"load_emotion_templates: loading user-specified templates for emotions {list(sorted(emotions.keys()))}") + + self.emotions.update(emotions) + + emotion_names = set(self.emotion_names) + emotion_names.update(emotions.keys()) + self.emotion_names = list(sorted(emotion_names)) + else: + logger.info("load_emotion_templates: loaded default emotion templates") + + def load_animator_settings(self, settings: Optional[Dict[str, Any]] = None) -> None: + """Load animator settings. + + `settings`: `{setting0: value0, ...}` + Optional dict of settings. The type and semantics of each value depends on each + particular setting. + + For available settings, see `animator_defaults` in `talkinghead/tha3/app/app.py`. + + Particularly for the setting `"postprocessor_chain"` (pixel-space glitch artistry), + see `talkinghead/tha3/app/postprocessor.py`. + """ + global target_fps + + if settings is None: + settings = {} + + logger.info(f"load_animator_settings: user-provided settings: {settings}") + + # Warn about unknown settings (not an error, to allow running a newer client on an older server that might support only a subset of the keys the client knows about) + if settings: + unknown_fields = [field for field in settings if field not in animator_defaults] + if unknown_fields: + logger.warning(f"load_animator_settings: unknown keys in user-provided settings; maybe client is newer than server? List follows: {unknown_fields}") + + # Set default values for any settings not provided + for field, default_value in animator_defaults.items(): + type_match = (int, float) if isinstance(default_value, (int, float)) else type(default_value) + if field in settings and not isinstance(settings[field], type_match): + logger.warning(f"Ignoring invalid setting for '{field}': got {type(settings[field])} with value '{settings[field]}', expected {type_match}") + continue + if field not in settings: + settings[field] = default_value + + logger.info(f"load_animator_settings: final settings (filled in from defaults as necessary): {settings}") + + # Some settings must be applied explicitly. + settings = dict(settings) # copy to avoid modifying the original, since we'll pop some stuff. + + logger.debug(f"load_animator_settings: Setting new target FPS = {settings['target_fps']}") + target_fps = settings.pop("target_fps") # global variable, controls the network send rate. + + logger.debug("load_animator_settings: Sending new effect chain to postprocessor") + self.postprocessor.chain = settings.pop("postprocessor_chain") # ...and that's where the postprocessor reads its filter settings from. + + # The rest of the settings we can just store in an attribute, and let the animation drivers read them from there. + self._settings = settings + def load_image(self, file_path=None) -> None: """Load the image file at `file_path`, and replace the current character with it. @@ -430,20 +547,26 @@ class Animator: def animate_blinking(self, pose: List[float]) -> List[float]: """Eye blinking animation driver. + Relevant `self._settings` keys: + + `"blink_interval_min"`: float, seconds, lower limit for random minimum time until next blink is allowed. + `"blink_interval_max"`: float, seconds, upper limit for random minimum time until next blink is allowed. + `"blink_probability"`: float, at each frame at a reference of 25 FPS. FPS-corrected automatically. + `"blink_confusion_duration"`: float, seconds, upon entering "confusion" emotion, during which blinking + quickly in succession is allowed. + Return the modified pose. """ - # should_blink = (random.random() <= 0.03) - # Compute FPS-corrected blink probability CALIBRATION_FPS = 25 - p_orig = 0.03 # blink probability per frame at CALIBRATION_FPS + p_orig = self._settings["blink_probability"] # blink probability per frame at CALIBRATION_FPS avg_render_sec = self.render_duration_statistics.average() if avg_render_sec > 0: avg_render_fps = 1 / avg_render_sec # Even if render completes faster, the `talkinghead` output is rate-limited to `target_fps` at most. - avg_render_fps = min(avg_render_fps, TARGET_FPS) + avg_render_fps = min(avg_render_fps, target_fps) else: # No statistics available yet; let's assume we're running at `target_fps`. - avg_render_fps = TARGET_FPS + avg_render_fps = target_fps # Note direction: rendering faster (higher FPS) means less likely to blink per frame (to obtain the same blink density per unit of wall time) n = CALIBRATION_FPS / avg_render_fps # We give an independent trial for each of `n` (fictitious) frames elapsed at `CALIBRATION_FPS` during one actual frame at `avg_render_fps`. @@ -459,7 +582,7 @@ class Animator: if self.blink_interval is not None: # ...except when the "confusion" emotion has been entered recently. seconds_since_last_emotion_change = (time_now - self.last_emotion_change_timestamp) / 10**9 - if current_emotion == "confusion" and seconds_since_last_emotion_change < 10.0: + if current_emotion == "confusion" and seconds_since_last_emotion_change < self._settings["blink_confusion_duration"]: pass else: seconds_since_last_blink = (time_now - self.last_blink_timestamp) / 10**9 @@ -477,13 +600,24 @@ class Animator: # Typical for humans is 12...20 times per minute, i.e. 5...3 seconds interval. self.last_blink_timestamp = time_now - self.blink_interval = random.uniform(2.0, 5.0) # seconds; duration of this blink before the next one can begin + self.blink_interval = random.uniform(self._settings["blink_interval_min"], + self._settings["blink_interval_max"]) # seconds; duration of this blink before the next one can begin return new_pose def animate_talking(self, pose: List[float], target_pose: List[float]) -> List[float]: """Talking animation driver. + Relevant `self._settings` keys: + + `"talking_fps"`: float, how often to re-randomize mouth during talking animation. + Early 2000s anime used ~12 FPS as the fastest actual framerate of + new cels (not counting camera panning effects and such). + `"talking_morph"`: str, see `posedict_keys` for available values. + Which morph to use for opening and closing the mouth during talking. + Any other morphs in the mouth-open group are set to zero while + talking is in progress. + Works by randomizing the mouth-open state in regular intervals. When talking ends, the mouth immediately snaps to its position in the target pose @@ -492,7 +626,7 @@ class Animator: Return the modified pose. """ MOUTH_OPEN_MORPHS = ["mouth_aaa_index", "mouth_iii_index", "mouth_uuu_index", "mouth_eee_index", "mouth_ooo_index", "mouth_delta"] - TALKING_MORPH = "mouth_aaa_index" + talking_morph = self._settings["talking_morph"] if not is_talking: try: @@ -511,7 +645,7 @@ class Animator: # With 25 FPS (or faster) output, randomizing the mouth every frame looks too fast. # Determine whether enough wall time has passed to randomize a new mouth position. - TARGET_SEC = 1 / 12 # Early 2000s anime used ~12 FPS as the fastest actual framerate of new cels (not counting camera panning effects and such). + TARGET_SEC = 1 / self._settings["talking_fps"] # rate of "actual new cels" in talking animation time_now = time.time_ns() update_mouth = False if self.last_talking_timestamp is None: @@ -523,7 +657,7 @@ class Animator: # Apply the mouth open morph new_pose = list(pose) # copy - idx = posedict_key_to_index[TALKING_MORPH] + idx = posedict_key_to_index[talking_morph] if self.last_talking_target_value is None or update_mouth: # Randomize new mouth position x = pose[idx] @@ -538,7 +672,7 @@ class Animator: # Zero out other morphs that affect mouth open/closed state. for key in MOUTH_OPEN_MORPHS: - if key == TALKING_MORPH: + if key == talking_morph: continue idx = posedict_key_to_index[key] new_pose[idx] = 0.0 @@ -549,9 +683,25 @@ class Animator: def compute_sway_target_pose(self, original_target_pose: List[float]) -> List[float]: """History-free sway animation driver. - original_target_pose: emotion pose to modify with a randomized sway target + `original_target_pose`: emotion pose to modify with a randomized sway target - The target is randomized again when necessary; this takes care of caching internally. + Relevant `self._settings` keys: + + `"sway_morphs"`: List[str], which morphs can sway. By default, this is all geometric transformations, + but disabling some can be useful for some characters (such as robots). + For available values, see `posedict_keys`. + `"sway_interval_min"`: float, seconds, lower limit for random time interval until randomizing new sway pose. + `"sway_interval_max"`: float, seconds, upper limit for random time interval until randomizing new sway pose. + Note the limits are ignored when `original_target_pose` changes (then immediately refreshing + the sway pose), because an emotion pose may affect the geometric transformations, too. + `"sway_macro_strength"`: float, [0, 1]. In sway pose, max abs deviation from emotion pose target morph value + for each sway morph, but also max deviation from center. The `original_target_pose` + itself may use higher values; in such cases, sway will only occur toward the center. + See the source code of this function for the exact details. + `"sway_micro_strength"`: float, [0, 1]. Max abs random noise to sway target pose, added each frame, to make + the animation look less robotic. No limiting other than a clamp of final pose to [-1, 1]. + + The sway target pose is randomized again when necessary; this takes care of caching internally. Return the modified pose. """ @@ -563,10 +713,9 @@ class Animator: # slowing down when we approach the target state. # As documented in the original THA tech reports, on the pose axes, zero is centered, and 1.0 = 15 degrees. - random_max = 0.6 # max sway magnitude from center position of each morph - noise_max = 0.02 # amount of dynamic noise (re-generated every frame), added on top of the sway target - - SWAYPARTS = ["head_x_index", "head_y_index", "neck_z_index", "body_y_index", "body_z_index"] + random_max = self._settings["sway_macro_strength"] # max sway magnitude from center position of each morph + noise_max = self._settings["sway_micro_strength"] # amount of dynamic noise (re-generated every frame), added on top of the sway target, no clamping except to [-1, 1] + SWAYPARTS = self._settings["sway_morphs"] # some characters might not sway on all axes (e.g. a robot) def macrosway() -> List[float]: # this handles caching and everything time_now = time.time_ns() @@ -600,7 +749,8 @@ class Animator: self.last_sway_target_pose = new_target_pose self.last_sway_target_timestamp = time_now - self.sway_interval = random.uniform(5.0, 10.0) # seconds; duration of this sway target before randomizing new one + self.sway_interval = random.uniform(self._settings["sway_interval_min"], + self._settings["sway_interval_max"]) # seconds; duration of this sway target before randomizing new one return new_target_pose # Add dynamic noise (re-generated every frame) to the target to make the animation look less robotic, especially once we are near the target pose. @@ -618,9 +768,13 @@ class Animator: def animate_breathing(self, pose: List[float]) -> List[float]: """Breathing animation driver. + Relevant `self._settings` keys: + + `"breathing_cycle_duration"`: seconds. Duration of one full breathing cycle. + Return the modified pose. """ - breathing_cycle_duration = 4.0 # seconds + breathing_cycle_duration = self._settings["breathing_cycle_duration"] # seconds time_now = time.time_ns() t = (time_now - self.breathing_epoch) / 10**9 # seconds since breathing-epoch @@ -634,10 +788,14 @@ class Animator: new_pose[idx] = math.sin(cycle_pos * math.pi)**2 # 0 ... 1 ... 0, smoothly, with slow start and end, fast middle return new_pose - def interpolate_pose(self, pose: List[float], target_pose: List[float], step: float = 0.1) -> List[float]: + def interpolate_pose(self, pose: List[float], target_pose: List[float]) -> List[float]: """Interpolate from current `pose` toward `target_pose`. - `step`: [0, 1]; how far toward `target_pose` to interpolate. 0 is fully `pose`, 1 is fully `target_pose`. + Relevant `self._settings` keys: + + `"pose_interpolator_step"`: [0, 1]; how far toward `target_pose` to interpolate in one frame, + assuming a reference of 25 FPS. This is FPS-corrected automatically. + 0 is fully `pose`, 1 is fully `target_pose`. This is a kind of history-free rate-based formulation, which needs only the current and target poses, and the step size; there is no need to keep track of e.g. the initial pose or the progress along the trajectory. @@ -786,15 +944,16 @@ class Animator: # CALIBRATION_FPS = 25 # FPS for which the default value `step` was calibrated xrel = 0.5 # just some convenient value + step = self._settings["pose_interpolator_step"] alpha_orig = 1.0 - step if 0 < alpha_orig < 1: avg_render_sec = self.render_duration_statistics.average() if avg_render_sec > 0: avg_render_fps = 1 / avg_render_sec - # Even if render completes faster, the `talkinghead` output is rate-limited to `TARGET_FPS` at most. - avg_render_fps = min(avg_render_fps, TARGET_FPS) - else: # No statistics available yet; let's assume we're running at `TARGET_FPS`. - avg_render_fps = TARGET_FPS + # Even if render completes faster, the `talkinghead` output is rate-limited to `target_fps` at most. + avg_render_fps = min(avg_render_fps, target_fps) + else: # No statistics available yet; let's assume we're running at `target_fps`. + avg_render_fps = target_fps # For a constant target pose and original `α`, compute the number of animation frames to cover `xrel` of distance from initial pose to final pose. n_orig = math.log(1.0 - xrel) / math.log(alpha_orig) diff --git a/talkinghead/tha3/app/manual_poser.py b/talkinghead/tha3/app/manual_poser.py index 3cf1897..e687c7a 100644 --- a/talkinghead/tha3/app/manual_poser.py +++ b/talkinghead/tha3/app/manual_poser.py @@ -939,6 +939,35 @@ class MainFrame(wx.Frame): logger.info(f"Saved image {image_file_name}") except Exception as exc: logger.error(f"Could not save {image_file_name}, reason: {exc}") + + # Save `_emotions.json`, for use as customized emotion templates. + # + # There are three possibilities what we could do here: + # + # - Trim away any morphs that have a zero value, because zero is the default, + # optimizing for file size. But this is just a small amount of text anyway. + # - Add any zero morphs that are missing. Because `self.emotions` came from files, + # it might not have all keys. This yields an easily editable file that explicitly + # lists what is possible. + # - Just dump the data from `self.emotions` as-is. This way the content for each + # emotion matches the emotion templates in `talkinghead/emotions/*.json`. + # This approach is the most transparent. + # + # At least for now, we opt for transparency. It is also the simplest to implement. + # + # Note that what we produce here is not a copy of `_defaults.json`, but instead, the result + # of the loading logic with fallback. That is, the content of the individual emotion files + # overrides the factory presets as far as `self.emotions` is concerned. + # + # We just trim away the [custom] and [reset] "emotions", which have no meaning outside the manual poser. + # The result will be stored in alphabetically sorted order automatically, because `dict` preserves + # insertion order, and `self.emotions` itself is stored alphabetically. + logger.info(f"Saving {dir_name}/_emotions.json...") + trimmed_emotions = {k: v for k, v in self.emotions.items() if not (k.startswith("[") and k.endswith("]"))} + emotions_json_file_name = os.path.join(dir_name, "_emotions.json") + with open(emotions_json_file_name, "w") as file: + json.dump(trimmed_emotions, file, indent=4) + logger.info("Batch save finished.") finally: dir_dialog.Destroy() diff --git a/talkinghead/tha3/app/postprocessor.py b/talkinghead/tha3/app/postprocessor.py index 3bc07ca..6b4e131 100644 --- a/talkinghead/tha3/app/postprocessor.py +++ b/talkinghead/tha3/app/postprocessor.py @@ -31,7 +31,7 @@ import torchvision # ("banding", {}), # ("scanlines", {}) # ] -default_chain = [] # TODO: disabled temporarily to get a PR in early, since we are still missing config support in client +default_chain = [] # Overridden by the animator, which sends us the chain. T = TypeVar("T") Atom = Union[str, bool, int, float]