Files
ComfyUI_frontend/apps/hub/knowledge/models/humo.md
dante01yoon bbd0a6b201 feat: migrate workflow template site as apps/hub
Migrate workflow_templates/site into the frontend monorepo as apps/hub
so the hub can use @comfyorg/design-system and shared packages.

Changes to existing files:
- pnpm-workspace.yaml: add @astrojs/sitemap, @astrojs/vercel, lucide-vue-next
- eslint.config.ts: add hub ignores and i18n/import rule overrides
- .oxlintrc.json: add hub scripts to ignore patterns
- knip.config.ts: add hub workspace config

apps/hub adaptations from source:
- Replace local cn() with @comfyorg/tailwind-utils (19 files)
- Integrate @comfyorg/design-system/css/base.css in global.css
- Make TEMPLATES_DIR configurable via HUB_TEMPLATES_DIR env var
- Add HUB_SKIP_SYNC flag for builds without template data
- Remove Vite 8-incompatible rollupOptions.output.manualChunks
- Fix stylelint violations (modern color notation, number precision)
- Gitignore generated content (thumbnails, synced templates, AI cache)
2026-04-06 20:53:13 +09:00

1.9 KiB

HuMo

HuMo is a human-centric video generation model by ByteDance that produces videos from collaborative multi-modal conditioning using text, image, and audio inputs.

Model Variants

HuMo (Wan2.1-T2V-1.3B based)

  • Built on the Wan2.1-T2V-1.3B video foundation model
  • Supports Text+Image (TI), Text+Audio (TA), and Text+Image+Audio (TIA) modes
  • Two-stage training: subject preservation then audio-visual sync

Key Features

  • Multi-modal conditioning: text, reference images, and audio simultaneously
  • Subject identity preservation from reference images across frames
  • Audio-driven lip synchronization with facial expression alignment
  • Focus-by-predicting strategy for facial region attention during audio sync
  • Time-adaptive guidance dynamically adjusts input weights across denoising steps
  • Minimal-invasive image injection maintains base model prompt understanding
  • Progressive two-stage training separates identity learning from audio sync
  • Supports text-controlled appearance editing while preserving identity

Hardware Requirements

  • Minimum: 24GB VRAM (RTX 3090/4090 or similar)
  • Multi-GPU inference supported via FSDP and sequence parallelism
  • Whisper-large-v3 audio encoder required for audio modes
  • Optional audio separator for cleaner speech input

Common Use Cases

  • Digital avatar and virtual presenter creation
  • Audio-driven talking head generation
  • Character-consistent video clips from reference photos
  • Lip-synced dialogue video from audio tracks
  • Prompted reenactment with identity preservation
  • Text-controlled outfit and style changes on consistent subjects

Key Parameters

  • mode: Generation mode (TI, TA, or TIA)
  • scale_t: Text guidance strength (default: 7.5)
  • scale_a: Audio guidance strength (default: 2.0)
  • frames: Number of output frames (97 at 25 FPS = ~4 seconds)
  • height/width: Output resolution (480p or 720p supported)
  • steps: Denoising steps (30-50 recommended)