Dubbing Journal
ArticlesAbout
ENRead latest
Dubbing Journal

Independent reporting on AI dubbing, localization, and voice technology.

Sections

  • Articles
  • About
  • RSS Feed
  • Sitemap
  • llms.txt

Company

  • About
  • Facts
  • hello@dubbingjournal.com

© 2026 Dubbing Journal. All rights reserved.

No affiliates. No sponsored content.

HomeArticles
technology

How AI Lip Sync Actually Works

A technical breakdown of the AI lip sync pipeline — from face detection to video synthesis — with real accuracy numbers and honest limitations.

Dubbing Journal

Dubbing Journal

April 8, 2026 · 7 min read

Table of Contents

  1. 01The four-stage pipeline
  2. 02Face detection and landmark tracking
  3. 03From audio to mouth shapes
  4. 04Video synthesis and rendering
  5. 05Where it breaks down

The four-stage pipeline

AI lip sync is a four-stage pipeline that transforms dubbed audio into matching mouth movements on existing video. The stages — face detection, phoneme extraction, mouth shape prediction, and video synthesis — run sequentially, and each one introduces its own error margin. Understanding these stages explains both why the technology works as well as it does and why it fails when it does.

Most people encounter lip sync as a single button click in a dubbing tool. In the broader context of where AI dubbing stands in 2026, underneath that button sits a chain of neural networks, each specialized and each fragile in its own way.

Face detection and landmark tracking

The pipeline starts with finding and tracking the face across every frame. Modern detectors like MediaPipe or RetinaFace locate the face bounding box, then extract 68-478 facial landmarks depending on the model. These landmarks map the contours of the jaw, lips, nose, and eyes with sub-pixel precision.

At 24 fps, a 10-minute video means 14,400 frames to process. Each frame requires landmark detection in under 40 milliseconds to keep processing practical. According to Slator's 2025 AI Dubbing Market Report, commercial tools achieve consistent landmark tracking on 94% of frames for frontal shots.

That 94% number hides important nuance. Tracking confidence drops sharply when the subject turns past 30 degrees from center. Profile views — common in dialogue scenes shot with over-the-shoulder framing — push accuracy below 70%. Some tools handle this by freezing the last confident landmark position and interpolating, which creates a subtle but noticeable stiffness.

Fast head motion causes a different problem. Motion blur smears facial features across the frame, and landmarks either jitter or disappear entirely. The Wav2Lip paper (Prajwal et al., 2020) documented a 12% drop in sync accuracy on frames with significant motion blur, a number that hasn't improved much since.

From audio to mouth shapes

The second and third stages happen almost simultaneously. The system extracts phonemes from the dubbed audio — the individual speech sounds — and maps each phoneme to a corresponding mouth shape, called a viseme.

English has roughly 44 phonemes that map to about 14-22 distinct visemes, depending on the model's granularity. The mapping isn't one-to-one. The phonemes /b/, /p/, and /m/ all produce the same closed-lips viseme. This is actually helpful — it means the system doesn't need perfect phoneme recognition to produce reasonable lip movement.

Phoneme extraction runs through an automatic speech recognition (ASR) frontend, typically a transformer-based model. Processing speed here is fast: 50-100x real time on a modern GPU, according to benchmarks from the VideoReTalking pipeline (Cheng et al., 2022). This stage rarely bottlenecks the pipeline.

But language matters enormously. Models trained predominantly on English struggle with phoneme sets they haven't seen enough of. Tonal languages like Mandarin present a particular challenge because pitch variation can change which viseme the model predicts — a problem that doesn't exist in English. Industry benchmarks suggest accuracy for non-English languages runs 7-12 percentage points lower than English, according to Slator (2025).

The viseme prediction stage also has to handle coarticulation — the way adjacent sounds influence mouth shape. Your mouth starts forming the next sound before finishing the current one. Good models predict this overlap. Cheap ones don't, and the result looks robotic: each mouth shape snaps into place independently, like a ventriloquist dummy.

Video synthesis and rendering

This is where the magic happens. And where most of the compute budget goes.

The synthesis network takes the original video frame, the predicted viseme sequence with timing — often driven by a voice clone of the original speaker — and generates a new lower-face region that matches the target audio. The rest of the face and the entire background stay untouched. Modern approaches use a GAN (generative adversarial network) or diffusion-based generator trained on millions of talking-head videos.

Processing cost is substantial. On an NVIDIA A100 GPU, most pipelines render at 2-5x real time. A 10-minute clip takes 20-50 minutes of GPU time. At cloud pricing of roughly $1.50-3.00 per GPU hour, the compute cost for lip sync alone runs $0.50-2.50 per minute of video. That cost is separate from the voice synthesis and translation steps.

Resolution matters more than you'd expect. At 720p, the mouth region is small enough that minor artifacts disappear. At 4K, every imperfection is visible — slight color mismatches at the blend boundary, texture inconsistencies on the chin, teeth that look slightly different from one frame to the next. Most tools quietly downscale the face region, process it, then upscale back. The result works, but it introduces a subtle softness around the mouth that trained eyes spot immediately.

The best current systems, like those documented in the VideoReTalking paper, separate the pipeline into face parsing, lip sync generation, and face enhancement — three distinct networks. This modularity lets each component improve independently, but it also means three potential failure points.

Where it breaks down

AI lip sync has five reliable failure modes, and anyone evaluating these tools should test for all of them.

Occlusion. Hands touching the face, microphones, or other objects crossing the mouth area confuse the generator. The model hallucinates mouth shapes behind the obstruction, often producing uncanny distortions. No current commercial tool handles this well.

Profile and three-quarter views. As mentioned, landmark tracking degrades past 30 degrees. But the synthesis network has a separate problem — it has far fewer training examples of side-view mouths, so generated shapes look less natural. Some tools switch to audio-only timing adjustment (keeping original mouth movement, just shifting timing) for non-frontal angles.

Emotional extremes. Shouting, crying, laughing — high-intensity expressions deform the face in ways that don't follow normal viseme patterns. The model defaults to neutral-to-moderate expressions because that's what dominates training data.

Fast speech. Above roughly 180 words per minute, the phoneme-to-viseme mapping can't keep up with the natural coarticulation speed. Mouth shapes start lagging or blurring together.

Teeth and tongue. These are the hardest elements to synthesize convincingly. Teeth have specular reflections that shift with lighting and angle. The tongue is rarely visible in training data but critical for sounds like /l/, /th/, and /n/. Most systems avoid rendering the tongue entirely, which looks fine for most phonemes but wrong for close-up shots.

The honest assessment: AI lip sync in 2026 works well enough for corporate video, e-learning, social media content, and mid-shot dialogue scenes. It does not yet match the quality bar for theatrical release close-ups or high-emotion dramatic scenes. That gap is closing — accuracy benchmarks improve roughly 3-5 percentage points annually — but it isn't closed yet.

Tools differ significantly in how they handle these edge cases. Some fail silently, producing bad output. Others flag low-confidence frames for human review. The latter approach is almost always worth the extra integration effort.

Back to articles