Generating Video with LTX-2.3 and ComfyUI

LTX-2.3 is the first open-source model that generates synchronized video and audio together in a single pass. Built by Lightricks, this 22B-parameter Diffusion Transformer doesn’t just create video — it produces matching audio at the same time, whether that’s dialogue with lip-synced characters, ambient soundscapes, or music-driven visuals. What makes it stand out:

Joint audio-video generation — no need to generate video and audio separately and try to sync them
Multiple generation modes — text-to-video, image-to-video, audio-conditioned video, and keyframe interpolation
Fast inference — the distilled pipeline generates in just 8 denoising steps
Open source — full model weights available on HuggingFace, trainable with LoRA in under an hour

Running LTX-2.3 requires a powerful GPU. Vast.ai gives you on-demand access to the hardware you need, and the pre-built ComfyUI template means you can go from zero to generating videos in minutes — no CLI, SSH, or manual setup required.

Find and rent your GPU

Setup your Vast account and add credit: Review the quickstart guide if you do not have an account with credits loaded.
Deploy the LTX-2.3 template: Go to the LTX-2.3 model page and click Deploy Now. This takes you to the Vast console with the LTX-2.3 ComfyUI template pre-selected.
Select a GPU: Choose an instance from the list and click Rent.

Vast.ai console showing available RTX 5090 instances with the LTX-2.3 ComfyUI template

Wait for provisioning

After renting, the instance automatically downloads all required model weights. You’ll see a loading screen while models download. On a fast connection this takes just a few minutes. Once complete, the instance status shows a green Running indicator.

Running instance showing verified status

Open ComfyUI

Click the Open button on your instance to launch the Instance Portal. Click Launch Application under ComfyUI to open the visual workflow editor.

Instance Portal showing ComfyUI, API Wrapper, Jupyter, and other services

In the left sidebar under Workflows > Browse, you’ll see four pre-loaded workflows:

Workflow	Description
`video_ltx2_3_t2v`	Text-to-Video
`video_ltx2_3_i2v`	Image-to-Video
`video_ltx2_3_ia2v`	Image + Audio-to-Video
`video_ltx2_3_flf2v`	First & Last Frame Interpolation

Text-to-Video

Select video_ltx2_3_t2v from the sidebar. Enter a descriptive prompt in the Video Generation node — describe camera angles, lighting, and motion cinematically. Adjust width, height, and frame count if desired (defaults: 1280x720, 121 frames, 25 fps). Click Run.

The workflow includes automatic prompt enhancement powered by Gemma 3, which expands short prompts into detailed cinematic descriptions.

Image-to-Video

Select video_ltx2_3_i2v. Upload a reference image in the Load Image node (a sample Egyptian queen image is included). Enter a prompt describing how the image should come to life. Click Run. The model uses your image as the first frame and generates consistent motion.

Image + Audio-to-Video

Select video_ltx2_3_ia2v. Upload a reference image and an audio file (a sample MP3 is included). Enter a prompt describing the scene. Click Run. The model generates video synchronized to the audio — lip movements match dialogue, and scene energy follows the audio’s rhythm.

First & Last Frame Interpolation

Select video_ltx2_3_flf2v. Load two images — a first frame and a last frame (sample car images are included). Enter a prompt describing the transition. Click Run. The model generates a smooth video interpolation between your two keyframes.

Cleanup

When finished, go to the Vast.ai console and click Delete on your instance to stop charges.

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

Find and rent your GPU

Wait for provisioning

Open ComfyUI

Text-to-Video

Image-to-Video

Image + Audio-to-Video

First & Last Frame Interpolation

Cleanup

Resources

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

​Find and rent your GPU

​Wait for provisioning

​Open ComfyUI

​Text-to-Video

​Image-to-Video

​Image + Audio-to-Video

​First & Last Frame Interpolation

​Cleanup

​Resources

Find and rent your GPU

Wait for provisioning

Open ComfyUI

Text-to-Video

Image-to-Video

Image + Audio-to-Video

First & Last Frame Interpolation

Cleanup

Resources