Files
ComfyUI_frontend/apps/hub/knowledge/models/sd3-5.md
dante01yoon bbd0a6b201 feat: migrate workflow template site as apps/hub
Migrate workflow_templates/site into the frontend monorepo as apps/hub
so the hub can use @comfyorg/design-system and shared packages.

Changes to existing files:
- pnpm-workspace.yaml: add @astrojs/sitemap, @astrojs/vercel, lucide-vue-next
- eslint.config.ts: add hub ignores and i18n/import rule overrides
- .oxlintrc.json: add hub scripts to ignore patterns
- knip.config.ts: add hub workspace config

apps/hub adaptations from source:
- Replace local cn() with @comfyorg/tailwind-utils (19 files)
- Integrate @comfyorg/design-system/css/base.css in global.css
- Make TEMPLATES_DIR configurable via HUB_TEMPLATES_DIR env var
- Add HUB_SKIP_SYNC flag for builds without template data
- Remove Vite 8-incompatible rollupOptions.output.manualChunks
- Fix stylelint violations (modern color notation, number precision)
- Gitignore generated content (thumbnails, synced templates, AI cache)
2026-04-06 20:53:13 +09:00

2.4 KiB
Raw Blame History

Stable Diffusion 3.5

Stable Diffusion 3.5 is Stability AI's text-to-image model family based on the Multimodal Diffusion Transformer (MMDiT) architecture with rectified flow matching.

Model Variants

Stable Diffusion 3.5 Large

  • 8.1 billion parameter MMDiT model
  • Highest quality and prompt adherence in the SD family
  • 1 megapixel native resolution (1024×1024)
  • 28-50 inference steps recommended

Stable Diffusion 3.5 Large Turbo

  • Distilled version of SD 3.5 Large
  • 4-step inference for fast generation
  • Guidance scale of 0 (classifier-free guidance disabled)
  • Comparable quality to full model at fraction of the time

Stable Diffusion 3.5 Medium

  • 2.5 billion parameter MMDiT-X architecture
  • Designed for consumer hardware (9.9GB VRAM for transformer)
  • Dual attention blocks in first 12 transformer layers
  • Multi-resolution generation from 0.25 to 2 megapixels
  • Skip Layer Guidance recommended for better coherency

Key Features

  • Three text encoders: CLIP ViT-L, OpenCLIP ViT-bigG (77 tokens each), T5-XXL (256 tokens)
  • QK-normalization for stable training and easier fine-tuning
  • Rectified flow matching replaces traditional DDPM/DDIM sampling
  • Strong text rendering and typography in generated images
  • Diverse output styles (photography, 3D, painting, line art)
  • Highly customizable base for fine-tuning and LoRA training
  • T5-XXL encoder optional (can be removed to save memory with minimal quality loss)
  • Supports negative prompts for excluding unwanted elements

Hardware Requirements

  • Large: 24GB+ VRAM recommended (fp16), quantizable to fit smaller GPUs
  • Large Turbo: 16GB+ VRAM recommended
  • Medium: 10GB VRAM minimum (excluding text encoders)
  • NF4 quantization available via bitsandbytes for low-VRAM GPUs
  • CPU offloading supported via diffusers pipeline

Common Use Cases

  • Photorealistic image generation
  • Artistic illustration and concept art
  • Typography and text-heavy designs
  • Product visualization
  • Fine-tuning and LoRA development
  • ControlNet-guided generation

Key Parameters

  • steps: 28-50 for Large, 4 for Large Turbo, 20-40 for Medium
  • guidance_scale: 4.5-7.5 for Large/Medium, 0 for Large Turbo
  • max_sequence_length: T5 token limit (77 or 256, higher = better prompt understanding)
  • resolution: 1024×1024 native, flexible aspect ratios around 1MP
  • negative_prompt: Text describing elements to exclude (not supported by Turbo)