Run LTX-2 on ComfyUI Locally and Free Generate Videos with Audio
The landscape of AI video generation has fundamentally shifted in January 2026.
The landscape of AI video generation has fundamentally shifted in January 2026. For the first time, a production-ready, open-source model capable of generating synchronized 4K video and audio—LTX-2—is freely available to anyone with adequate hardware.
This comprehensive guide walks you through installing, configuring, and mastering LTX-2 on ComfyUI—the leading node-based AI interface—so you can create professional-grade videos with perfectly synchronized audio entirely on your own machine.
What is LTX-2? The Game-Changing Model Explained
LTX-2, developed by Lightricks and released as open-source on January 6, 2026, is fundamentally different from earlier video generation models. While predecessors like Sora and Runway Gen-3 generate silent video and require post-hoc audio synthesis (leading to sync issues), LTX-2 generates video and audio simultaneously through an asymmetric dual-stream transformer architecture.
You can fine-tune it with LoRA (Low-Rank Adaptation), customize it for specific use cases, and deploy it anywhere. There's no per-second billing, no API rate limits, and no cloud dependency.
Why ComfyUI? The Best Interface for AI Video Generation
ComfyUI is a node-based interface designed specifically for generative AI workflows. Unlike traditional command-line tools, ComfyUI's visual node-based approach makes complex AI pipelines intuitive: you build workflows by connecting nodes.
ComfyUI is particularly well-suited to LTX-2 for several reasons.
First, the custom LTXVideo nodes integrate seamlessly into ComfyUI's architecture, providing intuitive controls for resolution, frame rate, sampling steps, and guidance scales.
Second, ComfyUI's built-in support for VRAM management, model offloading, and multi-GPU inference means you can run LTX-2 even on consumer hardware with careful optimization.
Third, the community has created extensive example workflows—pre-built templates for text-to-video, image-to-video, depth-guided video generation, and more—so beginners can start creating without building workflows from scratch.
What Hardware Do You Actually Need?
This is the question everyone asks first. According to official documentation and real-world testing from January 2026, here's what you need:
Minimum Hardware Requirements:
- GPU: NVIDIA GPU with 32GB+ VRAM. Testers have successfully run LTX-2 on an RTX 5070 Ti (16GB) with FP8 quantization and careful VRAM optimization, though this represents the absolute lower bound.
- System RAM: 32GB minimum
- Storage: 100GB free SSD space (50GB for models, 30GB for cache, 20GB for output)
- CUDA: Version 12.1 or higher
- Python: 3.10+ (3.12 recommended)
Recommended Configuration for Comfortable Use:
- GPU: RTX 4090 (24GB) or better, or NVIDIA A100/H100 for production
- RAM: 64GB+ system memory
- Storage: 200GB+ NVMe SSD
- Processor: Modern multi-core CPU (8+ cores)
The critical insight is that LTX-2's VRAM requirements scale with output resolution and duration. A 4-second clip at 720p on an RTX 4090 uses approximately 20-21GB VRAM, leaving headroom for the full generation pipeline. Attempting native 4K generation at longer durations pushes even the 24GB 4090 to its limits. This is where quantization becomes essential.
The Quantization Advantage
One of LTX-2's most powerful features is support for multiple quantization formats, which compress the model weights while maintaining quality. NVIDIA's integration of NVFP4 and NVFP8 formats into LTX-2—announced in early January 2026—is a game-changer for local generation.
FP8 Quantization (Recommended for Most Users):
- Model size: Uses ~30% less VRAM than full precision
- Speed: ~2x faster generation
- Quality impact: Minimal, imperceptible in most cases
- Best for: Users with 32GB+ VRAM wanting a balance of speed and quality
NVFP4 Quantization (Maximum Speed):
- Model size: Uses 60% less VRAM than full precision
- Speed: 3x faster generation
- Quality impact: Slight reduction, acceptable for many use cases
- Requirement: NVIDIA RTX 40-series or newer GPUs
- Best for: Users with limited VRAM or those prioritizing speed
For context, on an RTX 4090, NVFP4 can generate an 8-second clip at 720p in approximately 25 seconds, compared to 180+ seconds with the full precision model.
Distilled Model (8-Step Fast Generation):
- Speed: 5-6x faster than full model
- Steps: Fixed at 8 (cannot be adjusted)
- Quality: Good for testing and iteration, slightly lower than full model
- Best for: Rapid prompting iteration, A/B testing different concepts
The choice of quantization method directly impacts your usable generation resolution and duration.
Step-by-Step Installation: From Zero to Generating Videos
Step 1: Install ComfyUI
Begin by cloning the ComfyUI repository and setting up a Python virtual environment:
bash# Clone ComfyUI clone https://github.com/comfyanonymous/ComfyUI.git
gitcd ComfyUI# Create virtual environment
python -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate -r requirements.txt
# Install dependencies
pip installpip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121# Launch ComfyUI
python main.py
Open your browser to http://localhost:8188 to verify installation. You should see the ComfyUI interface with a blank workflow canvas.
Step 2: Install LTX-2 Custom Nodes
The recommended method is using ComfyUI Manager, which automates the installation:
- Launch ComfyUI and press
Ctrl+M(Windows/Linux) orCmd+M(Mac) - Click "Install Custom Nodes"
- Search for "LTXVideo"
- Find "ComfyUI-LTXVideo" by Lightricks and click Install
- Wait 2-5 minutes for installation to complete
- Restart ComfyUI
After restart, right-click in the workflow canvas and navigate to "Add Node" → "LTXVideo" to verify the nodes are available.
Step 3: Download Model Files
LTX-2 requires several model files (approximately 50GB total). Create the proper directory structure in ComfyUI:
textComfyUI/
├── models/
│ ├── checkpoints/ # Main model
│ ├── text_encoders/ # Text encoder
│ └── latent_upscale_models/ # Upscalers (optional)
Download the FP8 quantized checkpoint (recommended):
bashpip install huggingface-hub
huggingface-cli download Lightricks/LTX-2 ltx-2-19b-dev-fp8 --local-dir ComfyUI/models/checkpoints/
Download the text encoder (Gemma 3 12B IT quantized):
bashhuggingface-cli download Lightricks/LTX-2 gemma-3-12b-it-qat-q4_0-unquantized --local-dir ComfyUI/models/text_encoders/
Step 4: Load Example Workflows
The easiest way to start is using pre-built workflows. In ComfyUI, click "Load" → "Template Library" and select "LTX-2 Text to Video" or download example workflows from the official repository.
The main workflows include:
- Text-to-Video (Full): High-quality video from text prompts (30-50 steps)
- Text-to-Video (Distilled): Fast testing with 8-step generation
- Image-to-Video (Full): Animate still images with conditioning
- Image-to-Video (Distilled): Quick image animation tests
- Depth-Guided: Control video structure with depth maps
- Multi-Control: Advanced control with depth, pose, and canny edges
Step 5: Configure Your First Generation
Let's create a text-to-video using the loaded workflow. Key parameters:
Text Prompt (critical for audio-video quality):
textA serene morning in a Japanese garden during cherry blossom season.
Soft pink petals gently fall to the ground. Water in a stone fountain
creates subtle ripples. Birds sing softly in the background. Soft
natural light filters through the trees. Camera slowly pans left
to right. Ambient forest sounds with distant bird calls.
Notice how this prompt describes both visual AND audio elements. LTX-2 generates better synchronized audio when you explicitly describe sounds.
Key Generation Parameters:
- Frame Rate: 24 FPS (cinematic), 30 FPS (smooth), 60 FPS (ultra-smooth but requires more VRAM)
- Resolution: Start with 768x512 for testing, scale up to 1024x576 (HD) or 1280x720 (Full HD) as hardware allows
- Number of Frames: Must be divisible by 9. Start with 27 (for ~1.1 seconds at 24 FPS)
- Sampling Steps: 30-50 for full model, 8 for distilled
- CFG Scale (guidance strength): 5.0-7.0 recommended (higher = stricter prompt adherence but may reduce quality)
With these settings on an RTX 4090 using FP8, generation takes approximately 3-5 minutes for a 4-second clip at 720p.
Step 6: Generate and Review
Click "Queue Prompt" in the top-right corner. ComfyUI displays progress in the terminal and browser. The output video appears in the preview panel with both audio and video. Save the video by right-clicking the output node.
Advanced Features: Unlock LTX-2's Full Creative Potential
Once you've mastered basic text-to-video, explore these advanced capabilities:
Image-to-Video (I2V) Generation:
Convert still images into dynamic videos while maintaining composition and style. Load any image and provide a prompt describing desired motion:
textPositive Prompt: "The person in the image begins to smile,
then turns to face the camera. Subtle lighting adjustments.
Soft background music begins. Natural facial expression."
Using lower CFG values (3.0-5.0) preserves image consistency. This is ideal for product demos, character animation, and photo-to-motion workflows.
Depth-Guided Video Generation:
Use depth maps to control spatial structure and camera perspective. This is particularly powerful for maintaining consistent 3D geometry across generations:
- Load a reference image
- Apply "Image to Depth Map (Lotus)" preprocessing
- Connect the depth map to the guidance node
- Adjust guidance strength (0.5-1.0)
This creates cinematic camera movements while maintaining spatial coherence—useful for architectural walkthroughs and complex scene generation.
Pose-Driven Animation:
Control character movement with DWPose (DWPreprocessor). This enables frame-level control of human motion:
- Extract pose from reference video/image
- Optionally load Pose Control LoRA for enhanced accuracy
- Connect pose guidance to the generation pipeline
Dancers, action sequences, and performance captures become possible without professional motion capture equipment.
Canny Edge Control:
Use edge detection to preserve structural boundaries and architectural details:
- Apply Canny edge detection to your reference
- Adjust threshold values (low: 100, high: 200)
- Balance edge guidance with text prompt influence
Excellent for line art animation and maintaining precise object boundaries.
Spatial and Temporal Upscaling:
LTX-2 includes dedicated upscaler models to enhance quality post-generation:
- Spatial Upscaler (2x): Doubles resolution (768×512 → 1536×1024) with sharp details
- Temporal Upscaler (2x): Doubles frame rate (24 FPS → 48 FPS) for smooth motion
Chain them together for a 2-step pipeline: generate at 768×512 @ 24 FPS, then upscale to 1536×1024 @ 48 FPS. This often produces better results than attempting direct high-resolution generation.
LoRA Fine-Tuning for Consistent Style:
Train your own LoRA (Low-Rank Adaptation) weights to teach LTX-2 your specific artistic style or subject matter. Using LTX-2 Trainer with just 10-50 video clips of your target style:
- Prepare training dataset (videos or image sequences)
- Use official LTX-2 Trainer (available on GitHub)
- Training takes 1-2 hours on modern GPUs
- Load resulting LoRA weights in ComfyUI workflows
- Blend with main model (strength 0.5-1.0)
This enables consistent character appearances, branded visual styles, and subject-specific generations that would be impossible with the base model.
Audio Quality and Synchronization: The Game-Changing Feature
The synchronized audio generation sets LTX-2 apart from every other open-source model. Unlike chaining separate video and audio models (e.g., Kling for video + ElevenLabs for speech), LTX-2 generates both modalities simultaneously, ensuring perfect temporal alignment.
Audio quality depends significantly on prompt description. Explicit audio specifications yield the best results:
Excellent Prompt (with audio):
textA coffee shop at morning. Espresso machine hisses and steams.
Cups clink as the barista sets them on the counter. Soft jazz
music plays in the background. Customers have hushed conversations.
The door chimes as a new customer enters. Ambient sounds of
urban morning traffic outside the window.
Poor Prompt (no audio description):
textA coffee shop scene with people.
The model generates synchronized:
- Dialogue and speech with proper lip synchronization
- Foley effects (mechanical sounds, impacts)
- Ambient soundscapes
- Music (though less detailed than specialized music generation models)
Audio quality is generally excellent for dialogue, good for foley and effects, and adequate for ambient sound. For music-heavy projects, you might still layer additional music post-generation, but the ambient audio typically doesn't require replacement.
Pricing and Cost Analysis
To fully appreciate LTX-2's advantages, consider the total cost of ownership for generating videos with different platforms over one year:
LTX-2 Local Setup (One-Time Investment):
- RTX 4090 GPU: $1,500-$1,800
- 64GB RAM: $300-$400
- 1TB NVMe SSD: $80-$120
- Electricity (250W GPU, $0.12/kWh, 500 hours/year): ~$15/month
- Total Year 1: ~$2,100-$2,500 (amortized over 3-5 years)
- Cost per video (1000 videos/year): $2.10-$2.50
Sora 2 Cloud API (Per-Usage Model):
- 10-second video (Standard): $1.00
- 10-second video (Pro 4K): $3.00-$5.00
- Cost for 1000 videos/year (average 10 seconds): $1,000-$5,000
- No upfront hardware investment
Runway Gen-3 (Credit-Based):
- Typical cost: $0.05-$0.10 per second of video
- 10-second video: $0.50-$1.00
- Cost for 1000 videos/year: $500-$1,000
Pika 2.1 Subscription:
- Professional Plan: $79/month = $948/year
- Includes 50 videos/month (600/year) at up to 10 minutes each
- Additional videos require higher-tier subscription
- Effective cost: Minimum $948/year for modest production
Breakeven Analysis:
- LTX-2 breaks even with Pika at approximately 400 videos/year
- LTX-2 breaks even with Sora 2 at approximately 100 videos/year
- For content creators and agencies generating 50+ videos monthly, LTX-2 ROI occurs within 3-4 months
The financial advantage is undeniable for any serious creator. Even casual hobbyists generating 100 videos annually benefit from LTX-2's zero-per-video cost structure.
Advanced Optimization Techniques
For professionals and studios, extracting maximum efficiency is essential:
Multi-GPU Parallelization:
With 2+ NVIDIA GPUs, you can distribute LTX-2's inference across devices:
bashpython main.py --multi-gpu --gpu-ids 0,1
Expected improvements:
- 2 GPUs: ~1.7x speed (not perfect 2x due to synchronization overhead)
- 4 GPUs: ~3x speed
- Enables higher resolutions on systems that would VRAM-bottleneck with single GPU
Workflow Optimization Patterns:
- Test with distilled model first: Confirm your prompt and parameters work before committing to 50-step full model runs
- Use progressive resolution: Generate at 768×512, then upscale, rather than attempting direct 1280×720
- Batch processing: Queue multiple prompts. Models stay loaded in VRAM between generations, avoiding reload overhead
- Tiled VAE decoding: For VRAM-constrained systems:
textAdd "Tiled VAE Decode" node → Set tile size: 512×512, Overlap: 64px
Result: 50% VRAM reduction, 15-20% speed reduction
- Cache text encodings: For variations of the same prompt:
textAdd "Save Text Encoding" node → Reuse encoded embeddings across generations
Result: Avoids re-encoding the 12B Gemma text encoder
Common Issues and Solutions
"CUDA out of memory" Errors:
Solutions in priority order:
- Reduce resolution to 512×512 or 768×512
- Reduce frame count to 18 or 27 frames
- Enable NVFP4 quantization (requires RTX 40-series+)
- Launch with
python main.py --reserve-vram 4(reserves 4GB for OS) - Use tiled VAE decoding
- Reduce batch size or close other applications
Poor Prompt Adherence (Model Ignores Parts of Instructions):
Root causes and fixes:
- Problem: Model ignores slow-motion request
- Fix: Explicitly include camera speed in prompt (e.g., "slow, deliberate movement")
- Problem: Background details missing
- Fix: Lead with background description, use higher CFG (8-10)
- Problem: Inconsistent style across generations
- Fix: Train/use LoRA weights for style consistency
Audio Quality Issues:
- No audio: Verify muxing node is connected; regenerate (audio synthesis can be inconsistent)
- Speech misalignment: Distilled model may have lower audio quality; use full model for dialogue
- Silent scenes: Model struggles with speech-free content; describe ambient sounds explicitly
Slow Generation Times:
Diagnostics:
- Check GPU utilization:
nvidia-smi(should show 95%+ utilization) - If GPU underutilized: Check CPU bottleneck, verify PyTorch is compiled for CUDA
- If GPU maxed: Use NVFP4, reduce resolution, or accept longer times
- Update NVIDIA drivers to latest (performance improvements release monthly)
Practical Examples for Common Use Cases
Example 1: Product Demonstration Video (B2B Marketing)
textPrompt: "A sleek silver smartphone sits on a black glass table.
Soft studio lighting highlights the device edges. The phone's
screen illuminates with app icons. Camera slowly zooms in on
the device. Subtle ambient electronic sounds. Professional
product photography ambiance."
Settings:
- Resolution: 1024×576
- Duration: 27 frames (1.1 sec at 24 FPS)
- CFG: 7.0
- Steps: 35 (full model)
- Time: ~4 minutes on RTX 4090 FP8
Result: Professional product video requiring 30-60 seconds of footage
to demonstrate key features (requires multiple generations)
Example 2: Social Media Short (TikTok/Reels)
textPrompt: "A trendy young woman dances in a bright modern apartment.
Natural sunlight streams through large windows. Upbeat lo-fi hip-hop
music plays. Camera captures dynamic movement with quick cuts.
Energy is fun and relatable. The woman smiles at the camera.
Urban contemporary aesthetic."
Settings:
- Resolution: 768×512 (no upscaling needed for mobile)
- Duration: 36 frames (1.5 sec at 24 FPS)
- CFG: 6.0
- Steps: 30 (full model, quality is important for audience engagement)
- Time: ~3.5 minutes on RTX 4090 FP8
Result: 15-30 second total when combined with music and editing.
Ready for immediate posting.
Example 3: Long-Form Narrative Content
textCreate a 10-second cinematic scene by generating three 4-second clips:
Scene 1 - Establishing:
"Vast desert landscape during golden hour. Sand dunes extend
to the horizon. Warm sunlight creates dramatic shadows. Wind
gently moves sand. Sparse vegetation scattered across the dunes.
Soft, contemplative instrumental music. Camera pans across the landscape."
Scene 2 - Character Introduction:
"A lone figure walks across the desert dunes. Weathered clothing.
Determined expression. Footsteps echo in the silence. Wind whistles.
Dramatic shadows cast by the setting sun. Camera follows the character
from a distance. Tension-building music swells."
Scene 3 - Climactic Moment:
"The character reaches the peak of a tall dune and gazes at the
vast landscape. Powerful orchestral music crescendos. Golden light
bathes the scene. Camera slowly zooms out to show the character's
smallness against the immense landscape. Emotional, awe-inspiring mood."
Settings: 20-25 frames per clip, 24 FPS, 1024×576
Timeline: 10+ minutes total generation time
Post-Production: Stitch clips together, color grade, add transitional
effects, potentially add voice-over narration
Result: Professional cinematic micro-film suitable for film festivals,
YouTube, or short-form narrative platforms.
USP: What Makes LTX-2 Different
LTX-2's competitive advantages are not merely incremental:
- Native Audio-Video Synchronization: Generative models that create video and audio in parallel, not sequential pipelines that chain separate models. This eliminates sync drift and rendering delays.
- Production-Ready Open Source: Full access to weights, training code, and inference implementation. You can fine-tune, modify, and deploy without licensing restrictions or usage quotas.
- 20-Second Temporal Coherence: Long-form video generation (6-20 seconds) with consistent quality throughout—critical for narrative content, demonstrations, and complex scenes.
- 4K Native Output: 2160p resolution at 50 FPS from a single pass, not upscaled from lower resolutions (though upscalers improve quality further).
- Local Privacy and Control: No cloud dependency means complete data privacy, no API rate limits, and ability to run offline after initial model download.
- NVIDIA Optimization: Direct partnership with NVIDIA ensures cutting-edge optimizations (NVFP4, NVFP8) that provide 3x speedups and 60% VRAM reduction—benefits immediately accessible to RTX GPU owners.
- LoRA Fine-Tuning: Efficient parameter adaptation allowing custom style training on consumer hardware without full model retraining.
Why LTX-2 Wins for Creators
Against Sora 2:
- Audio: LTX-2 generates synchronized audio natively; Sora 2 offers limited audio support
- Cost: LTX-2 free (after hardware); Sora 2 costs $1-$5 per 10-second video
- Control: LTX-2 allows full fine-tuning and LoRA training; Sora 2 is a black box
- Advantage to Sora 2: Superior photorealism and motion coherence in many scenarios (proprietary training data advantage)
Against Runway Gen-3:
- Audio: LTX-2 synchronized audio; Runway Gen-3 limited
- Cost: LTX-2 free; Runway requires API credits or subscription
- Privacy: LTX-2 local; Runway uploads to cloud servers
- Advantage to Runway: Faster iteration, more intuitive UI, better consistency in photorealistic scenarios
Against Pika 2.1:
- Resolution: LTX-2 4K; Pika 720p
- Cost: LTX-2 free; Pika $29-$76/month
- Audio: LTX-2 synchronized; Pika none
- Advantage to Pika: Better mobile UX, faster generation times, no hardware requirement
The verdict: For creators who can invest in hardware and value cost efficiency, privacy, and technical control, LTX-2 is unquestionably superior. For those prioritizing speed, ease-of-use, and photorealism without hardware investment, proprietary cloud solutions remain competitive.
Future of Local AI Video Generation
LTX-2 released as fully open-source on January 6, 2026—a date that will likely be remembered as pivotal in AI democratization. Within days, the community began:
- Optimizing models for even lower VRAM requirements
- Training custom LoRAs for specific styles and domains
- Building inference optimizations exceeding NVIDIA's official ones
- Exploring fine-tuning on domain-specific video datasets
By mid-January 2026, users reported successfully running LTX-2 on RTX 3090 (24GB) with careful quantization, RTX 4080 (12GB) with heavy optimization, and even testing RTX 5070 Ti (16GB) successfully. The community is systematically breaking hardware barriers, making LTX-2 accessible to those without enterprise-grade GPUs.
Conclusion: The Practical Guide to Getting Started Today
Running LTX-2 on ComfyUI is no longer a technical challenge reserved for machine learning experts. The installation process, while requiring some command-line comfort, is now straightforward and well-documented. The performance is production-grade: 4K video with synchronized audio, generated locally at costs that amount to mere pennies per video.