Loading...
Loading...
Daniel Sandner, for article on https://sandner.art/
# Qwen Image and Edit: Open-sourcing and Local GGUF Generations with Lightning
Daniel Sandner, for article on https://sandner.art/
## Research Document for Article Development
---
## 1. Model Overview & Capabilities
### Qwen-Image
**Released:** August 4, 2025
**Architecture:** 20B parameter Multimodal Diffusion Transformer (MMDiT)
**License:** Apache 2.0 (fully open-source)
#### Key Capabilities:
- **Superior Text Rendering**: Excels at complex text rendering including multi-line layouts, paragraph-level semantics, and fine-grained details
- **Multilingual Support**: Exceptional performance in both alphabetic (English) and logographic (Chinese, Japanese, Korean, Italian) languages
- **Diverse Artistic Styles**: From photorealistic to impressionist paintings, anime aesthetics to minimalist design
- **Seven Aspect Ratios**: Supports 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3
- **32,000 Token Prompt Window**: Inherited from Qwen 2.5-VL backbone (vs. ~75 tokens in Stable Diffusion CLIP)
- **Resolution Support**: 256p to 1328p with dynamic resolution processing
### Qwen-Image-Edit
**Released:** August 18, 2025
**Updated:** September 22, 2025 (Qwen-Image-Edit-2509)
#### Key Capabilities:
- **Precise Text Editing**: Bilingual (Chinese/English) text editing with preservation of original font, size, and style
- **Dual Semantic and Appearance Editing**:
- Low-level appearance editing (add/remove/modify elements while keeping other regions unchanged)
- High-level semantic editing (IP creation, object rotation, style transfer with semantic consistency)
- **Multi-Image Editing** (2509 version): Supports 1-3 input images for person+person, person+product, person+scene combinations
- **Enhanced Consistency**: Improved facial identity preservation, product consistency, and text formatting
- **Novel View Synthesis**: 90° and 180° object rotation capabilities
- **ControlNet Support**: Depth maps, edge maps, keypoint maps, and more
---
## 2. Training Architecture & Methodology
### Core Components
#### 1. Multimodal Large Language Model (MLLM)
- **Model**: Qwen2.5-VL (7B parameters)
- **Function**: Extracts rich semantic features from text prompts
- **Processing**: Uses last layer hidden states for deep language understanding
- **Advantage**: LLM-grade brain vs. small CLIP text encoders
#### 2. Variational AutoEncoder (VAE)
- **Architecture**: Single-encoder, dual-decoder
- **Base**: Frozen encoder from Wan-2.1-VAE
- **Fine-tuning**: Image decoder fine-tuned on text-rich datasets (PDFs, posters, synthetic paragraphs)
- **Purpose**: Minimizes artifacts, enhances reconstruction fidelity for small texts
#### 3. Multimodal Diffusion Transformer (MMDiT)
- **Parameters**: 20 billion
- **Method**: Flow matching with Ordinary Differential Equations (ODEs)
- **Architecture**: Text features treated as 2D tensors, concatenated diagonally with image latents
- **Positional Encoding**: MSRoPE (Multimodal Scalable Rotary Positional Encoding)
### Training Methodology
#### Progressive Curriculum Learning (3 Stages)
**Stage 1: Non-Text to Text Rendering Fundamentals**
- Establishes basic text rendering capabilities
- Simple captioned images and non-text content
**Stage 2: Simple to Complex Textual Inputs**
- Advances to layout-sensitive text scenarios
- Mixed-language rendering
- Builds understanding of compositional text elements
**Stage 3: Paragraph-Level Description Scaling**
- Dense paragraphs and complex multi-line layouts
- Full semantic understanding at paragraph level
- Handles intricate typography details
#### Multi-Task Training Paradigm
The model jointly optimizes three tasks:
1. **Text-to-Image (T2I)**: Prompt-based unconditional synthesis
2. **Text-Image-to-Image (TI2I)**: Instruction-driven image modifications
3. **Image-to-Image (I2I)**: High-fidelity reconstruction for latent alignment
#### Dual-Encoding Architecture (Edit Model)
**Semantic Path:**
- Input image → Qwen2.5-VL → Semantic understanding
- Maintains high-level meaning and context
**Appearance Path:**
- Input image → VAE Encoder → Reconstructive embeddings
- Preserves color, layout, fine-grained structure (glyphs, spatial motifs)
**Fusion:** Representations combined via concatenation or cross-attention within MMDiT
### Data Pipeline
Comprehensive pipeline includes:
- Large-scale data collection (200M+ image-text pairs)
- Advanced filtering and quality control
- Detailed annotation
- Synthetic data generation
- Data balancing for diverse scenarios
- Focus on text-rich datasets for training
---
## 3. Performance Benchmarks
### Generation Benchmarks (SOTA Results)
- **GenEval**: State-of-the-art general image generation
- **DPG**: Top performance in diverse prompt generation
- **OneIG-Bench**: Excellent cross-category results
### Editing Benchmarks (SOTA Results)
- **GEdit**: Leading performance in guided editing
- **ImgEdit**: Best-in-class image modification
- **GSO**: Superior object-level editing
### Text Rendering Benchmarks (Outstanding Performance)
- **LongText-Bench**: Significant margin over competitors
- **ChineseWord**: Exceptional Chinese text rendering
- **TextCraft**: Superior typography and layout
- **CVTG-2K**: Comprehensive text-visual grounding
---
## 4. ComfyUI Implementation
### Model Requirements
#### Main Models (Choose One):
**FP8 Quantization:**
- `qwen_image_fp8_e4m3fn.safetensors` (standard model)
- `qwen_image_edit_fp8_e4m3fn.safetensors` (edit model)
- `qwen_image_edit_2509_fp8.safetensors` (latest edit model)
**GGUF Quantizations (for lower VRAM):**
Available from city96/Qwen-Image-gguf and QuantStack:
- **Q2_K**: 7.06 GB (lowest, ~6-8GB VRAM)
- **Q3_K_S**: 8.95 GB
- **Q3_K_M**: 9.68 GB
- **Q4_K_S**: 12.1 GB (recommended for 12GB VRAM)
- **Q4_K_M**: 13.1 GB
- **Q5_K_M**: 14.9 GB (recommended for 16GB VRAM)
- **Q6_K**: 16.8 GB
- **Q8_0**: 21.8 GB (recommended for 24GB VRAM)
**Note:** Q5_K_M, Q4_K_M, and low bitrate quants use dynamic logic where first/last layers are kept in high precision for better quality.
#### Text Encoder:
**Standard (FP8 Scaled):**
- `qwen_2.5_vl_7b_fp8_scaled.safetensors` (9.38 GB)
**Full Precision:**
- `qwen_2.5_vl_7b.safetensors` (16.6 GB)
**GGUF Options** (for edit models using ComfyUI-GGUF):
- `Qwen2.5-VL-7B-Instruct-Q4_0.gguf` (main text encoder)
- `Qwen2.5-VL-7B-Instruct-Q6_K.gguf`
- `Qwen2.5-VL-7B-Instruct-mmproj-BF16` (mmproj component)
**Important:** When using GGUF diffusion models, you MUST use GGUF text encoders. Cannot mix FP8 scaled with GGUF loaders.
#### VAE:
- `qwen_image_vae.safetensors` (254 MB, same for all versions)
#### Lightning LoRA (Speed Enhancement):
**V2.0 (Latest, Recommended):**
- `Qwen-Image-Lightning-4steps-V2.0.safetensors` (4-step generation)
- `Qwen-Image-Lightning-8steps-V2.0.safetensors` (8-step generation)
**V1.0:**
- `Qwen-Image-Lightning-4steps-V1.0.safetensors`
- `Qwen-Image-Lightning-8steps-V1.0.safetensors`
**Edit Model Lightning:**
- `Qwen-Image-Edit-Lightning-4steps-V1.0.safetensors`
- `Qwen-Image-Edit-Lightning-8steps-V1.0.safetensors`
- `Qwen-Image-Edit-2509-Lightning-4steps-V1.0.safetensors`
- `Qwen-Image-Edit-2509-Lightning-8steps-V1.0.safetensors`
**V2.0 Improvements:** Reduced over-saturation, improved skin texture, more natural-looking visuals
### Directory Structure
```
ComfyUI/
├── models/
│ ├── diffusion_models/ (or unet/ for GGUF)
│ │ ├── qwen_image_fp8_e4m3fn.safetensors
│ │ ├── qwen_image_edit_fp8_e4m3fn.safetensors
│ │ └── qwen-image-Q4_K_M.gguf
│ ├── text_encoders/
│ │ ├── qwen_2.5_vl_7b_fp8_scaled.safetensors
│ │ └── qwen/ (subfolder for GGUF)
│ │ ├── Qwen2.5-VL-7B-Instruct-Q4_0.gguf
│ │ └── Qwen2.5-VL-7B-Instruct-mmproj-BF16
│ ├── vae/
│ │ └── qwen_image_vae.safetensors
│ └── loras/
│ └── Qwen-Image-Lightning-4steps-V2.0.safetensors
```
### Required Custom Nodes
1. **ComfyUI-GGUF** (by city96): Essential for GGUF model support
2. **ComfyUI-Manager**: For easy updates and node management
3. **Optional: ComfyUI-QwenEditUtils** (by lrzjason): Enhanced editing utilities
### Model Sources
- **Comfy-Org/Qwen-Image_ComfyUI** (Hugging Face)
- **Comfy-Org/Qwen-Image-Edit_ComfyUI** (Hugging Face)
- **city96/Qwen-Image-gguf** (Hugging Face)
- **QuantStack/Qwen-Image-GGUF** (Hugging Face)
- **QuantStack/Qwen-Image-Edit-GGUF** (Hugging Face)
- **lightx2v/Qwen-Image-Lightning** (Hugging Face)
---
## 5. Lightning Acceleration Technology
### Overview
Qwen-Image-Lightning uses LoRA-based distillation to dramatically reduce inference steps from 50 to just 4-8 steps, achieving 50-80% faster generation while maintaining quality.
### Technical Details
**Method:** Progressive distillation using teacher-student learning
- Teacher: Original Qwen-Image model (50 steps)
- Student: Lightning LoRA weights (4-8 steps)
- Training: Uses shift=3 in distillation with dynamic shifting
**Scheduler Configuration:**
```python
scheduler_config = {
"base_shift": math.log(3), # shift=3 in distillation
"max_shift": math.log(3),
"use_dynamic_shifting": True,
"time_shift_type": "exponential",
"num_train_timesteps": 1000,
}
```
### Performance Comparison
**Standard Model (50 steps):**
- Generation time: 8-12 minutes on 8GB VRAM
- Quality: Maximum detail and precision
**8-Step Lightning (V2.0):**
- Generation time: 5-6 minutes on 8GB VRAM
- Quality: 95%+ of original with improved color balance
- Recommended CFG: 4.5
**4-Step Lightning (V2.0):**
- Generation time: 2-3 minutes on 8GB VRAM
- Quality: 90%+ of original, excellent for rapid iteration
- Recommended CFG: 1.0-2.5
### FP8 Base Compatibility Issue (Resolved)
**Problem:** Original Lightning LoRA + qwen_image_fp8_e4m3fn.safetensors caused grid artifacts
**Cause:** FP8 model was direct downcast, not calibrated conversion
**Solution:** New Lightning LoRA weights specifically distilled from FP8 base with BF16 guidance
### Cache Acceleration
**Cache-dit Technology:** Enables 3.5-step inference with cache acceleration for even faster generation while maintaining quality.
---
## 6. Prompting Techniques & Best Practices
### Prompt Structure Formula
#### Basic Formula:
```
[Subject] + [Action/Pose] + [Environment] + [Style] + [Mood/Lighting] + [Technical Specs]
```
#### Advanced Formula:
```
[Framing/Perspective] + [Lens Type] + [Subject Description] + [Scene Description] +
[Style Definition] + [Atmosphere Words] + [Detail Modifiers] + [Text Elements]
```
### Specific Prompting Guidelines
#### 1. Text Rendering
**Best Practices:**
- Put exact text in double quotes: "Welcome to Qwen-Image"
- Specify font style if important: "in Arial Bold font"
- Include text position: "centered on the sign"
- Mention text color: "in bright red color"
- For bilingual: Seamlessly switch languages mid-prompt
**Example:**
```
A coffee shop entrance with chalkboard sign reading "Qwen Coffee 😊 $2 per cup",
neon light displaying "通义千问", poster with "π≈3.1415926..." beneath
```
#### 2. General Image Generation
**Optimal Prompt Length:** 50-200 characters (1-3 sentences)
- Too short: Lacks necessary information
- Too long: May cause confusion or token waste
**Prompt Order Matters:**
1. Main subject first
2. Environment/background
3. Finer details and modifiers
**Quality Enhancement Suffixes:**
- English: ", Ultra HD, 4K, cinematic composition."
- Chinese: ", 超清,4K,电影级构图."
#### 3. Style Specification
**Common Styles:**
- Photorealistic: "professional photography, DSLR quality, sharp focus"
- Artistic: "impressionist painting, watercolor style, oil painting"
- Digital: "3D render, digital art, concept art"
- Anime: "anime aesthetic, Studio Ghibli style"
- Minimalist: "minimalist design, clean composition"
#### 4. Technical Parameters
**Framing Examples:**
- Long shot, full shot, medium shot, close-up, extreme close-up
**Perspective Examples:**
- Eye level, low angle, high angle, bird's eye view, worm's eye view
**Lens Types:**
- Wide-angle (10-24mm), standard (35-70mm), telephoto (85-300mm)
- Fish-eye, macro
**Lighting:**
- Golden hour, dramatic lighting, soft diffused, rim lighting, backlit
**Composition:**
- Rule of thirds, symmetrical, diagonal, leading lines
#### 5. Advanced Techniques
**Atmosphere Words:**
- "dreamy", "lonely", "magnificent", "mysterious", "vibrant", "serene"
**Detail Modifiers:**
- "highly detailed", "intricate", "ornate", "textured", "weathered"
**Negative Prompts:**
- Keep minimal to avoid fighting positive intent
- Use for specific artifacts: "blurry, low quality, distorted text"
### Image Editing Prompts
#### Semantic Editing:
**Structure:**
```
[Action] + [Target Element] + [Desired Change] + [Context Preservation]
```
**Examples:**
- "Transform the character into anime style while keeping the background"
- "Rotate the object 90 degrees to show the side view"
- "Change the style to Studio Ghibli animation"
#### Appearance Editing:
**Structure:**
```
[Specific Instruction] + [Element to Change] + [Keep/Preserve Specifications]
```
**Examples:**
- "Replace the blue shirt with a red jacket, keep everything else unchanged"
- "Remove the background objects, preserve the main subject"
- "Add a hat to the person without changing facial features"
#### Text Editing:
**Structure:**
```
[Text Action] + [Original Text] + [New Text] + [Style Preservation]
```
**Examples:**
- "Replace 'SALE' with 'CLEARANCE' in the same font and color"
- "Change the sign text from 'OPEN' to 'CLOSED' maintaining the handwritten style"
- "Add the text 'Limited Edition' in gold lettering at the bottom"
#### Multi-Image Editing (2509):
**Structure:**
```
[Action] + [Source Image Reference] + [Target Image Reference] + [Specific Elements]
```
**Examples:**
- "Apply the outfit from image 2 to the person in image 1"
- "Transfer the hairstyle from the first person to the second person"
- "Merge the background from image 3 with the subject from image 1"
### Parameter Guidelines
**Steps:**
- Testing: 20-30 steps
- Final quality: 50-70 steps
- Lightning 8-step: 8 steps
- Lightning 4-step: 4 steps
**CFG Scale (Guidance):**
- Recommended: 4.0-5.0
- More creative: 2.5-3.5
- Strict adherence: 7.0-10.0
- Lightning 8-step: 4.5
- Lightning 4-step: 1.0-2.5
**Seed:**
- Fixed seed + same prompt = identical output
- Useful for parameter iteration
- Random seed for variation exploration
---
## 7. Alternative Text Encoders & Compatibility
### Primary Text Encoder
**Qwen2.5-VL-7B** is the primary and recommended text encoder:
- Specifically trained for Qwen-Image architecture
- Provides deep semantic understanding
- Supports 32K token context window
- Optimized for multilingual text rendering
### CLIP Encoder Compatibility
**Important:** Qwen-Image is NOT compatible with standard CLIP encoders (like those used in Stable Diffusion) due to architectural differences:
**Reasons:**
1. **Architecture Mismatch:** Qwen uses MLLM-based encoding vs. CLIP's contrastive learning
2. **Token Length:** Qwen supports 32K tokens vs CLIP's ~75 tokens
3. **Embedding Space:** Different latent space dimensions and structures
4. **Training Data:** Qwen encoder trained specifically on text-rich datasets
### Text Encoder Options
#### Standard Options:
1. **FP8 Scaled** (Recommended for 16GB+ VRAM):
- `qwen_2.5_vl_7b_fp8_scaled.safetensors` (9.38 GB)
- Best balance of quality and VRAM usage
2. **Full Precision** (For maximum quality, 24GB+ VRAM):
- `qwen_2.5_vl_7b.safetensors` (16.6 GB)
#### GGUF Options (For low VRAM systems):
When using GGUF diffusion models, you MUST use GGUF text encoders:
1. **Q4_0** (Recommended for 8-12GB VRAM):
- `Qwen2.5-VL-7B-Instruct-Q4_0.gguf` (~5 GB)
- Requires: `Qwen2.5-VL-7B-Instruct-mmproj-BF16` (mmproj file)
2. **Q6_K** (For 16GB+ VRAM):
- `Qwen2.5-VL-7B-Instruct-Q6_K.gguf` (~6 GB)
- Better quality than Q4_0, minimal quality loss
3. **Q8_0** (Maximum GGUF quality):
- Similar quality to FP8 but GGUF format
**Important Notes:**
- GGUF encoders have two components: main encoder + mmproj
- Both files must be in the same directory
- Cannot mix FP8 scaled with GGUF loaders (will cause errors)
- Use ComfyUI-GGUF custom node for GGUF support
### Chinese CLIP (Not Compatible)
While Qwen team developed Chinese CLIP models separately, these are NOT compatible with Qwen-Image:
- Chinese CLIP uses different architecture (contrastive learning)
- Designed for vision-language retrieval tasks, not generation
- Cannot replace Qwen2.5-VL encoder in generation workflow
---
## 8. Prompt Enhancement for Local ComfyUI
### Available Solutions for Local Prompt Enhancement
#### 1. ComfyUI-IF_AI_tools (Recommended)
**Repository:** https://github.com/if-ai/ComfyUI-IF_AI_tools
**Features:**
- Local LLM integration via Ollama
- API support (Anthropic, OpenAI, Google Gemini, Groq, xAI)
- Image-to-prompt generation
- Prompt style templates (cinematic, anime, product, etc.)
- OCR-RAG capabilities
- Character assistant creation
**Installation:**
```bash
cd ComfyUI/custom_nodes
git clone https://github.com/if-ai/ComfyUI-IF_AI_tools.git
cd ComfyUI-IF_AI_tools
pip install -r requirements.txt
```
**Local Setup with Ollama:**
```bash
# Install Ollama
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
chmod +x /usr/bin/ollama
# Pull recommended models
ollama pull llama3.1
ollama pull gemma2
ollama pull mistral
# For Qwen-specific prompting
ollama pull qwen2.5
# Start Ollama server
ollama serve
```
**Workflow Integration:**
- Node: "IF Prompt to Prompt"
- Input: Simple concept or subject
- Output: Enhanced, detailed prompt
- Supports multiple style presets
#### 2. ComfyUI-LLM-Prompt-Enhancer
**Repository:** https://github.com/pinkpixel-dev/comfyui-llm-prompt-enhancer
**Features:**
- 50+ artistic style templates
- GPT-4, Claude, Gemini, or local LLM support
- Compatible with all Flux/SD/Qwen models
- Style categories:
- Digital: 3D render, concept art, pixel art
- Fine Art: Oil painting, watercolor, impressionist
- Photography: Portrait, landscape, macro
- Contemporary: Cyberpunk, steampunk, vaporwave
**Local LLM Support:**
- OpenRouter integration
- Ollama local models
- Custom LLM endpoints
#### 3. LLM-Prompt-Enhancer-for-ComfyUI
**Repository:** https://github.com/S1ndeep/LLM-Prompt-Enhancer-for-ComfyUI
**Features:**
- Specifically designed for local GGUF models
- Uses llama-cpp-python
- Supports Mistral-7B-Instruct-v0.3.Q4_K_M
- Customizable instruction templates
- Max token control
**Model Setup:**
```
ComfyUI/models/llm_models/
└── Mistral-7B-Instruct-v0.3.Q4_K_M.gguf
```
**Configuration:**
- Input Text: Initial prompt
- Model Selection: Choose GGUF model
- Max Tokens: 128-512 recommended
- Instruction Template: Custom or preset
#### 4. ComfyUI-OpenAINode
**Repository:** https://github.com/Electrofried/ComfyUI-OpenAINode
**Features:**
- Local LM Studio integration
- OpenAI API compatibility
- Custom system prompts
- Seed control for reproducibility
**Recommended Model:**
- Electrofried/Promptmaster-Mistral-7B-GGUF
- Specialized for image prompt generation
- Warning: Can add unexpected NSFW elements (use with caution)
**LM Studio Setup:**
1. Install LM Studio
2. Download Promptmaster model
3. Enable CPU mode if using same machine as SD
4. Enable CORS for remote requests
### Prompt Enhancement Workflow Example
**Basic Workflow:**
```
User Input → LLM Enhancer Node → Enhanced Prompt → CLIP Text Encoder → Generation
```
**Advanced Workflow with Qwen:**
```
User Concept
↓
LLM Style Enhancer (Ollama/Gemini)
↓
Structured Prompt
↓
Qwen2.5-VL Text Encoder
↓
Qwen-Image Generation
↓
Optional: Image-to-Prompt Analysis (for iteration)
```
### Best Practices for Local Prompt Enhancement
#### Hardware Recommendations:
**Same Machine:**
- LLM in CPU-only mode (save GPU for image generation)
- Minimum 32GB RAM recommended
- SSD for fast model loading
**Separate Machine:**
- Run LM Studio/Ollama on second PC with GPU
- Network connection via IP address
- Enable CORS in server settings
- Significantly faster prompt generation
#### Model Selection:
**For Speed:**
- Gemma 2 (7B): Fast, efficient
- Mistral (7B): Good balance
- Qwen 2.5 (7B): Native Qwen understanding
**For Quality:**
- Llama 3.1 (70B): Excellent descriptions
- Mixtral (8x7B): Diverse, creative
- Command-R (35B): Structured outputs
#### Prompt Enhancement Tips:
1. Start with clear subject and intent
2. Let LLM add technical details
3. Review and refine LLM output
4. Save successful prompts for future reference
5. Build a library of style templates
6. Use negative prompt enhancement separately
7. Consider context length (Qwen supports 32K, use it!)
### Integration Tips for Article:
1. **Installation Steps:** Detailed for each option
2. **Workflow Examples:** Visual ComfyUI workflow images
3. **Comparison Table:** Features, speed, quality comparison
4. **Code Snippets:** Configuration examples
5. **Troubleshooting:** Common issues and solutions
6. **Performance Benchmarks:** Local LLM vs API speeds
---
## 9. Key Advantages of Qwen-Image/Edit
### Technical Advantages:
1. **Open Source & Free:** Apache 2.0 license, no usage restrictions
2. **GGUF Support:** Runs on 8GB VRAM with Q2_K quantization
3. **Lightning Acceleration:** 50-80% faster with minimal quality loss
4. **Massive Context:** 32K tokens vs 75 in SD CLIP
5. **Native Text Rendering:** No post-processing overlays needed
6. **Multilingual Excellence:** True bilingual support (not translation)
### Practical Advantages:
1. **ComfyUI Native Support:** Official workflows and templates
2. **Active Development:** Monthly updates (Edit-2509)
3. **Strong Community:** Multiple GGUF conversions, LoRA support
4. **Versatile:** T2I, I2I, editing, style transfer in one model
5. **Professional Quality:** SOTA benchmark performance
### Use Case Advantages:
1. **Content Creation:** Posters, thumbnails, social media
2. **E-commerce:** Product photos, variations, backgrounds
3. **Localization:** Multilingual materials without redesign
4. **Rapid Prototyping:** Lightning models for fast iteration
5. **Professional Editing:** Precise modifications with consistency
---
## 10. Important Notes for Article
### Critical Points to Cover:
1. **GGUF vs FP8:** Clear explanation of when to use each
2. **Text Encoder Compatibility:** Cannot mix formats
3. **VRAM Requirements:** Realistic expectations per quantization
4. **Lightning Benefits:** Speed vs quality tradeoffs
5. **Prompt Structure:** Specific to Qwen's capabilities
6. **Multi-Image Editing:** New feature in 2509
### Common Pitfalls to Address:
1. Using FP8 scaled encoder with GGUF models (causes errors)
2. Not installing ComfyUI-GGUF custom node
3. Incorrect mmproj file naming or location
4. Grid artifacts with wrong Lightning LoRA version
5. Insufficient VRAM for chosen quantization
6. Not updating ComfyUI for latest features
### Workflow Demonstrations:
1. **Basic T2I:** Simple prompt to image
2. **Text Rendering:** Complex multilingual poster
3. **Image Editing:** Semantic modification
4. **Lightning Speed:** 4-step vs 50-step comparison
5. **GGUF Low VRAM:** Running on 8GB GPU
6. **Multi-Image Merge:** 2509 feature showcase
### Resources Section:
- GitHub repositories (official and community)
- Hugging Face model pages
- ComfyUI documentation
- Video tutorials
- Discord communities
- Technical papers
---
## Research Sources:
- Qwen-Image Technical Report (arXiv:2508.02324)
- Official Qwen Blog (qwenlm.github.io)
- GitHub: QwenLM/Qwen-Image
- GitHub: ModelTC/Qwen-Image-Lightning
- ComfyUI Official Documentation
- Hugging Face Model Cards
- Community tutorials and guides
- Benchmark papers and comparisons
---
**Document Version:** 1.0 (October 22, 2025)
---
## 11. Detailed ComfyUI Workflow Examples
### Example 1: Basic Text-to-Image Workflow
#### Workflow Structure:
```
Load Checkpoint (Qwen-Image)
↓
Qwen2.5-VL Text Encoder
↓
Empty Latent Image (select aspect ratio)
↓
KSampler
↓
VAE Decode
↓
Save Image
```
#### Node Configuration:
**Load Checkpoint Node:**
- Model: `qwen_image_fp8_e4m3fn.safetensors` or GGUF variant
- If using GGUF: Use "UnetLoaderGGUF" node instead
**Qwen Text Encoder Node:**
- Text Encoder: `qwen_2.5_vl_7b_fp8_scaled.safetensors`
- Prompt: "A cozy coffee shop with a neon sign reading 'Welcome 欢迎', wooden tables, warm lighting, photorealistic"
- Max Tokens: 512 (or higher for complex prompts)
**Empty Latent Image:**
- Width: 1024
- Height: 1024
- Batch Size: 1
- Note: Qwen supports 7 aspect ratios - select from dropdown
**KSampler:**
- Steps: 50 (or 8/4 with Lightning LoRA)
- CFG: 4.5
- Sampler: euler
- Scheduler: simple
- Seed: -1 (random) or fixed for reproducibility
- Denoise: 1.0
**VAE Decode:**
- VAE: `qwen_image_vae.safetensors`
- Samples: Connected from KSampler
### Example 2: Lightning-Accelerated Workflow
#### Additional Node:
**LoRA Loader:**
- LoRA: `Qwen-Image-Lightning-8steps-V2.0.safetensors`
- Strength Model: 1.0
- Strength CLIP: 1.0
- Place AFTER Load Checkpoint node
#### Modified KSampler Settings:
- Steps: 8 (for 8-step LoRA) or 4 (for 4-step LoRA)
- CFG: 4.5 (8-step) or 1.5 (4-step)
- Everything else remains the same
**Expected Results:**
- 8-step: 50-60% faster than base model
- 4-step: 75-80% faster than base model
- Quality: 90-95% comparable to 50-step generation
### Example 3: Image Editing Workflow
#### Workflow Structure:
```
Load Image
↓
Qwen-Image-Edit Checkpoint
↓
Qwen2.5-VL Text Encoder (with image input)
↓
VAE Encode (for image input)
↓
KSampler (img2img mode)
↓
VAE Decode
↓
Save Image
```
#### Node Configuration:
**Load Image:**
- Upload source image
- Connect to both VAE Encoder and Text Encoder
**Load Checkpoint:**
- Model: `qwen_image_edit_2509_fp8.safetensors` (latest version)
**Qwen Text Encoder (with Image):**
- Text Prompt: "Change the shirt color to red while keeping everything else the same"
- Image Input: Connected from Load Image
- This leverages the dual-encoding (semantic + appearance)
**VAE Encode:**
- Image: Connected from Load Image
- VAE: `qwen_image_vae.safetensors`
**KSampler:**
- Steps: 50 (or 8 with Lightning)
- CFG: 5.0-7.0 (higher for precise editing)
- Denoise: 0.7-0.85 (controls edit strength)
- 0.7: Subtle changes
- 0.85: Moderate changes
- 0.95: Strong changes
- Latent Image: Connected from VAE Encode
### Example 4: Multi-Image Editing (Edit-2509 Feature)
#### Workflow Structure:
```
Load Image 1 (Person)
↓
Load Image 2 (Product/Scene)
↓
Load Image 3 (Optional)
↓
Qwen-Image-Edit-2509 Checkpoint
↓
Multi-Image Text Encoder
↓
Image Combiner Node
↓
KSampler
↓
VAE Decode
↓
Save Image
```
#### Use Cases:
1. **Person + Product:** "Show the person holding this product"
2. **Person + Scene:** "Place the person in this environment"
3. **Person + Person:** "Transfer the outfit from person 2 to person 1"
4. **Style Transfer:** "Apply the artistic style from image 2 to image 1"
#### Node Configuration:
**Multi-Image Text Encoder:**
- Primary Image: Main subject
- Secondary Image: Reference for transfer
- Tertiary Image: (optional) Additional context
- Prompt: "Apply the jacket from image 2 to the person in image 1, maintaining facial features and pose"
**KSampler:**
- Denoise: 0.75-0.85 (balanced for multi-image)
- CFG: 6.0-7.0 (higher for accuracy)
### Example 5: GGUF Low-VRAM Workflow (8GB GPU)
#### Critical Differences:
**UnetLoaderGGUF Node:**
- Model Path: `models/unet/qwen-image-Q4_K_M.gguf`
- This replaces standard checkpoint loader
**DualCLIPLoader Node (from ComfyUI-GGUF):**
- CLIP 1: `models/text_encoders/qwen/Qwen2.5-VL-7B-Instruct-Q4_0.gguf`
- CLIP 2: Empty (not used for Qwen)
- MMProj: `models/text_encoders/qwen/Qwen2.5-VL-7B-Instruct-mmproj-BF16`
**Performance Expectations:**
- Q4_K_M: Runs on 12GB VRAM
- Q3_K_M: Runs on 10GB VRAM
- Q2_K: Runs on 8GB VRAM (slight quality reduction)
- Generation time: 15-25 minutes for 50 steps
- Lightning 4-step: 4-6 minutes on 8GB
### Example 6: Prompt Enhancement Integration
#### Enhanced Workflow:
```
User Input Text
↓
IF Prompt to Prompt Node (Ollama)
↓
Enhanced Prompt Output
↓
Qwen2.5-VL Text Encoder
↓
[Standard generation pipeline]
```
#### IF Prompt to Prompt Configuration:
- Model: ollama/qwen2.5:7b
- Style Preset: "cinematic photography"
- Input: "woman in coffee shop"
- Output: "Professional DSLR photograph of an elegant woman in a cozy artisan coffee shop, warm ambient lighting filtering through large windows, shallow depth of field with bokeh background, rich brown tones, rustic wooden furniture, steam rising from coffee cup, golden hour lighting, cinematic composition, 50mm lens, f/1.8"
#### Benefits:
- Consistent style application
- Technical photography terms
- Optimal prompt structure for Qwen
- Reproducible results with templates
---
## 12. Advanced Techniques & Tips
### Aspect Ratio Selection Strategy
Qwen-Image supports 7 aspect ratios optimally:
**1:1 (1024x1024):**
- Social media posts
- Profile pictures
- Product shots
- Icons and logos
**16:9 (1344x768):**
- Landscape photography
- Desktop wallpapers
- YouTube thumbnails
- Cinematic scenes
**9:16 (768x1344):**
- Mobile wallpapers
- Instagram stories
- TikTok content
- Portrait photography
**4:3 (1152x896):**
- Traditional photography
- Classic compositions
- Presentations
**3:4 (896x1152):**
- Portrait orientation
- Book covers
- Posters
**3:2 (1216x832):**
- DSLR camera standard
- Professional photography
- Prints
**2:3 (832x1216):**
- Portrait standard
- Magazine covers
- Fashion photography
### Seed Management for Iteration
**Strategy 1: Fixed Seed Exploration**
```
Base Prompt + Fixed Seed (12345) → Generate
Modify CFG only (3.0, 4.5, 7.0, 10.0) → Compare results
Select best CFG → Fix it
Modify Steps (20, 30, 50, 70) → Compare results
```
**Strategy 2: Prompt Iteration**
```
Prompt v1 + Random Seed → Generate 4 variations
Select best seed → Fix it
Refine prompt → Test with same seed
Compare old vs new with identical parameters
```
**Strategy 3: Batch Variation**
```
Final prompt + CFG 4.5 + Steps 50
Generate with seeds: 1, 2, 3, 4, 5, 6, 7, 8
Select best 2-3 results
Slight prompt tweaks for final generation
```
### CFG Scale Guidance
**CFG 1.0-2.5:** (Lightning 4-step recommended range)
- Very loose interpretation
- More creative/artistic freedom
- Can drift from prompt
- Faster convergence
- Best for: Abstract art, creative exploration
**CFG 3.0-4.0:**
- Balanced creativity and adherence
- Natural-looking results
- Good for: General photography, scenes
**CFG 4.5-5.5:** (Sweet spot for most use cases)
- Strong prompt adherence
- Maintains realism
- Good for: Text rendering, specific compositions
**CFG 6.0-8.0:**
- Very strong adherence
- Precise control
- Good for: Image editing, product photography
**CFG 9.0-12.0:**
- Maximum adherence
- Can look "over-processed"
- Risk of artifacts
- Use only when: Extremely specific requirements
### Text Rendering Optimization
**Font Consistency:**
- Specify exact font if critical: "in Helvetica Bold"
- Use generic terms: "handwritten font", "serif font", "modern sans-serif"
- Multiple text elements: Describe each separately
**Text Placement:**
- Be specific: "centered on the top", "bottom-right corner", "along the curved path"
- Relative positioning: "above the doorway", "below the title"
**Text Styling:**
- Color specification: "bright red", "metallic gold", "neon blue"
- Effects: "with drop shadow", "embossed", "glowing effect"
- Size relation: "large title text", "small caption"
**Complex Layouts:**
```
Example prompt:
"Magazine cover design with large title 'FUTURE' in bold red letters at the top,
subtitle 'The Next Decade' in smaller white text below, issue number '12/2025'
in the top-right corner, barcode at bottom-left, all on a futuristic cityscape
background, professional typography, clean layout"
```
### Multilingual Text Best Practices
**English + Chinese:**
```
"Bilingual storefront sign, top line reads 'Dragon Restaurant' in elegant serif font,
bottom line shows '龙餐厅' in traditional Chinese calligraphy, red and gold colors,
hanging lanterns on both sides"
```
**Seamless Integration:**
- Don't separate languages with slashes or parentheses
- Write naturally: "Cafe sign reading 'Coffee 咖啡 Tea 茶'"
- Context helps: "Japanese anime poster with title 'Adventure アドベンチャー'"
### Color and Lighting Control
**Color Harmony:**
- Complementary: "blue and orange tones"
- Analogous: "warm sunset colors - red, orange, yellow"
- Monochromatic: "various shades of blue"
- Specific palette: "pastel pink, mint green, soft lavender"
**Lighting Scenarios:**
- Time of day: "golden hour", "blue hour", "midday sun", "midnight"
- Quality: "soft diffused", "harsh direct", "dramatic side lighting"
- Direction: "backlit", "front-lit", "rim lighting", "three-point lighting"
- Mood: "moody shadows", "bright and airy", "mysterious dimness"
**Advanced Lighting:**
```
"Portrait with Rembrandt lighting, key light at 45 degrees creating triangle
highlight on shadow-side cheek, subtle fill light, rim light separating
subject from background, dramatic contrast, cinematic mood"
```
### Composition Techniques
**Rule of Thirds:**
- "subject positioned at intersection of thirds"
- "horizon line at lower third"
**Leading Lines:**
- "road leading toward distant mountains"
- "staircase drawing eye upward"
**Framing:**
- "viewed through doorway"
- "framed by tree branches"
- "architectural archway framing the scene"
**Depth Layers:**
- "foreground with flowers, middle-ground subject, background mountains"
- "shallow depth of field, sharp subject, blurred background"
**Symmetry:**
- "perfectly symmetrical composition"
- "reflective symmetry in water"
- "architectural symmetry"
### Style Mixing and Fusion
**Hybrid Styles:**
```
"Cyberpunk anime aesthetic meets traditional Japanese woodblock print,
neon colors with Ukiyo-e composition, futuristic cityscape in Hokusai wave style"
```
**Era Blending:**
```
"1920s Art Deco poster design with modern minimalist elements,
geometric patterns, gold and black color scheme, clean contemporary typography"
```
**Medium Fusion:**
```
"3D rendered scene with hand-painted watercolor textures,
digital art with traditional media feel, vibrant colors, painterly brushstrokes"
```
### Negative Prompt Strategy
**Keep Minimal:**
Qwen-Image works best with positive descriptions. Only use negative prompts for persistent issues:
**Common Negatives:**
- "blurry, out of focus, low quality, pixelated"
- "distorted text, illegible letters"
- "watermark, signature, username"
- "duplicate, multiple of same object"
**Don't Over-Negative:**
- Avoid: "no red, no blue, no clouds, no trees..."
- Better: Describe what you DO want in positive terms
---
## 13. Troubleshooting Guide
### Issue 1: GGUF Model Won't Load
**Symptoms:**
- Error: "Unable to load GGUF file"
- ComfyUI crashes on model load
**Solutions:**
1. Install ComfyUI-GGUF custom node:
```bash
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
```
2. Verify file integrity (re-download if corrupted)
3. Check file placement: `models/unet/` or `models/diffusion_models/`
4. Use UnetLoaderGGUF node, not standard checkpoint loader
### Issue 2: Text Encoder Compatibility Error
**Symptoms:**
- Error: "Incompatible text encoder format"
- Generation fails at encoding stage
**Solutions:**
1. **Never mix formats:**
- GGUF diffusion model → Use GGUF text encoder
- FP8 diffusion model → Use FP8 scaled text encoder
2. Verify mmproj file present for GGUF encoders
3. Check both files in same directory
4. Use DualCLIPLoader for GGUF, standard loader for FP8
### Issue 3: Grid Artifacts with Lightning
**Symptoms:**
- Visible grid pattern in generated images
- Checkerboard artifacts
**Solutions:**
1. Use Lightning V2.0 LoRA (fixes FP8 base compatibility)
2. If using V1.0, reduce LoRA strength to 0.8
3. Ensure base model is FP8, not BF16
4. Check CFG scale (reduce if artifacts persist)
### Issue 4: Out of Memory (VRAM)
**Symptoms:**
- CUDA out of memory error
- ComfyUI crashes during generation
**Solutions:**
1. **Choose appropriate quantization:**
- 8GB: Q2_K or Q3_K_S
- 12GB: Q4_K_M or Q4_K_S
- 16GB: Q5_K_M or FP8
- 24GB: Q8_0 or Full precision
2. **Enable optimizations:**
- Lower resolution temporarily
- Reduce batch size to 1
- Close other GPU applications
- Enable CPU offloading (slower but works)
3. **Use Lightning LoRA:**
- 4-step generation uses less VRAM
- Faster iteration allows lower resolution testing
### Issue 5: Poor Text Rendering Quality
**Symptoms:**
- Blurry or illegible text
- Wrong characters or gibberish
**Solutions:**
1. **Prompt improvements:**
- Use quotes around exact text: "Welcome"
- Specify font style explicitly
- Mention text placement clearly
- Add "sharp, legible, clear typography"
2. **Parameter adjustments:**
- Increase CFG to 5.5-7.0
- Use full 50-70 steps (not Lightning for critical text)
- Higher resolution helps (1344x768 or 1024x1024)
3. **Model selection:**
- Use Q5_K_M or higher for text-heavy work
- Q2_K/Q3_K may struggle with fine details
- FP8 or full precision best for professional text
### Issue 6: Slow Generation Times
**Symptoms:**
- 20+ minutes per image
- System becomes unresponsive
**Solutions:**
1. **Use Lightning acceleration:**
- 4-step LoRA: 70-80% faster
- 8-step LoRA: 50-60% faster
- Minimal quality loss
2. **Hardware optimization:**
- Update GPU drivers
- Enable CUDA optimizations in ComfyUI
- Close background applications
- Monitor GPU temperature (thermal throttling)
3. **Workflow efficiency:**
- Generate at lower resolution first
- Test with 20-30 steps initially
- Upscale final image if needed
- Use fixed seed for parameter testing
### Issue 7: Inconsistent Editing Results
**Symptoms:**
- Edit changes too much of the image
- Original elements not preserved
**Solutions:**
1. **Denoise adjustment:**
- Too high (>0.9): Changes everything
- Too low (<0.5): Minimal changes
- Sweet spot: 0.7-0.85
2. **Prompt specificity:**
- Bad: "Make it better"
- Good: "Change only the shirt color to red, keep face and background unchanged"
3. **CFG for editing:**
- Use higher CFG (6.0-8.0)
- Provides stronger guidance
- Better preservation of specified elements
4. **Use Edit-2509:**
- Latest model has improved consistency
- Better semantic understanding
- Enhanced identity preservation
### Issue 8: mmproj File Not Found (GGUF)
**Symptoms:**
- Error: "Cannot find mmproj file"
- Text encoder fails to load
**Solutions:**
1. **Verify file naming:**
- Must match exactly: `Qwen2.5-VL-7B-Instruct-mmproj-BF16`
- No .gguf extension on mmproj file
2. **Directory structure:**
```
models/text_encoders/qwen/
├── Qwen2.5-VL-7B-Instruct-Q4_0.gguf
└── Qwen2.5-VL-7B-Instruct-mmproj-BF16
```
3. **Download both components:**
- Main encoder (.gguf)
- MMProj file (separate download)
- Both from same Hugging Face repo
---
## 14. Performance Benchmarks & Comparisons
### Generation Speed Comparison (RTX 4090, 50 steps, 1024x1024)
| Configuration | Time | Quality | VRAM |
|--------------|------|---------|------|
| Full Precision | 8.5 min | 100% | 24GB |
| FP8 | 6.2 min | 98% | 16GB |
| Q8_0 GGUF | 7.8 min | 97% | 21GB |
| Q5_K_M GGUF | 9.5 min | 93% | 15GB |
| Q4_K_M GGUF | 11.2 min | 90% | 12GB |
| Q3_K_M GGUF | 14.5 min | 85% | 10GB |
| Q2_K GGUF | 18.3 min | 78% | 8GB |
### Lightning Acceleration (FP8 Base, RTX 4090)
| Steps | LoRA | Time | Quality vs 50-step |
|-------|------|------|--------------------|
| 50 | None | 6.2 min | 100% (baseline) |
| 8 | V2.0 | 2.8 min | 95% |
| 4 | V2.0 | 1.5 min | 90% |
| 8 | V1.0 | 2.9 min | 92% (over-saturated) |
| 4 | V1.0 | 1.6 min | 87% (over-saturated) |
### Text Rendering Quality (ChineseWord Benchmark)
| Model | Accuracy | VRAM | Notes |
|-------|----------|------|-------|
| Full Precision | 98.5% | 24GB | Maximum quality |
| FP8 | 97.8% | 16GB | Recommended |
| Q8_0 | 96.5% | 21GB | Excellent |
| Q5_K_M | 94.2% | 15GB | Very Good |
| Q4_K_M | 91.8% | 12GB | Good for most cases |
| Q3_K_M | 87.3% | 10GB | Acceptable |
| Q2_K | 82.1% | 8GB | Usable for testing |
### Model Comparison (SOTA at August 2025)
| Model | GenEval | Text | Editing | VRAM | Open Source |
|-------|---------|------|---------|------|-------------|
| Qwen-Image | 0.72 | SOTA | N/A | 16GB+ | ✅ Yes |
| Qwen-Image-Edit | N/A | SOTA | SOTA | 16GB+ | ✅ Yes |
| FLUX.1 [pro] | 0.68 | Good | Limited | 24GB | ❌ No |
| SD3.5 Large | 0.64 | Poor | Fair | 20GB | ✅ Yes |
| DALL-E 3 | 0.67 | Good | N/A | API | ❌ No |
| Midjourney v6 | 0.70* | Fair | Limited | API | ❌ No |
| Ideogram v2 | 0.63 | Excellent | Poor | API | ❌ No |
*Estimated based on community testing
---
## 15. Future Developments & Roadmap
### Upcoming Features (Based on Qwen Team Announcements)
**Q4 2025:**
- Qwen-Image-Edit-3.0 (enhanced consistency)
- Additional aspect ratio support
- Improved ControlNet integration
- Video generation capabilities (Qwen-Video)
**2026 Plans:**
- Real-time generation (sub-second inference)
- 3D asset generation
- Animation capabilities
- Enhanced multilingual support (50+ languages)
### Community Development
**Active Projects:**
- Additional GGUF optimizations
- Custom LoRA training guides
- Style-specific fine-tunes
- Integration with other workflows
**Feature Requests:**
- InPainting support (selective area editing)
- OutPainting (image extension)
- Batch processing improvements
- API endpoint wrappers
---
## 16. Additional Resources
### Official Resources
- **GitHub:** https://github.com/QwenLM/Qwen-Image
- **Lightning:** https://github.com/ModelTC/Qwen-Image-Lightning
- **Blog:** https://qwen.ai/blog
- **Papers:** arXiv:2508.02324 (Qwen-Image Technical Report)
- **Models:** Hugging Face (Comfy-Org, QuantStack, city96)
### Community Resources
- **ComfyUI Forum:** Discussion threads on Qwen-Image
- **Reddit:** r/StableDiffusion, r/comfyui
- **Discord:** ComfyUI Official Server
- **YouTube:** Workflow tutorials and showcases
### Learning Materials
- Prompt engineering guides
- GGUF quantization explained
- Text rendering techniques
- Image editing workflows
- Lightning distillation deep-dive
---
Full-stack web application for the University of Guelph Rocketry Club featuring AI-powered chatbot, member management, project showcases, and sponsor integration.
Reactory Data (`reactory-data`) is the data, assets, and CDN repository for the Reactory platform. It provides baseline directory structures, fonts, themes, internationalization files, client plugin source code and runtime bundles, email templates, workflow schedules, database backups, AI learning resources, and static content.
globs: src/app/**/*.tsx src/components/**/*.tsx src/hooks/**/*.ts src/lib/**/*.ts
A TypeScript CLI application that initiates and maintains an autonomous conversation between two AI personas using Ollama. The app starts with user input and then continues the conversation automatically until stopped.