Realistic Visual Speech Synthesis Based on AAM Features and an Articulatory DBN Model with Constrained Asynchrony

Peng Wu, Dongmei Jiang, He Zhang, Hichem Sahli

Research output: Contribution to journalConference paper


This paper presents a novel photo realistic visual speech synthesis method based on an audio visual articualtory dynamic Bayesian network model with constrained asynchrony (AF_AVDBN). Conditional probability
distributions are defined to control the asynchronies between the articulatory features, such as lips, tongue and glottis/velum. Perceptual linear prediction (PLP) features from audio speech and active appearance model (AAM) features from mouth images of the visual speech are adopted to train the AF_AVDBN model for continuous speech. An EM-based optimal visual feature learning algorithm is deduced given the input auditory speech and the trained AF_AVDBN parameters. Finally, photo realistic mouth images are synthesized from the learned AAM features. Objective evaluations show that the learned visual features using AF_AVDBN track the real parameters much more closely than those from the SA_DBN and SS_DBN model. Subjective evaluation results show that very high quality mouth animations can be obtained through the AF_AVDBN models. By considering the asynchronies
between articulatory features in AF_AVDBN (as well between audio and visual states in SA_DBN), the synchronism between the audio speech and mouth animations are well obtained. Moreover, the accuracy of the mouth animations from AF_AVDBN is much better than those from SA_DBN and SS_DBN because AF_AVDBN captures the dynamic
movements of articualtory features and thus model the pronunciation process more precisely.
Original languageEnglish
Pages (from-to)59-64
Number of pages6
JournalProceedings of the International Conference on Audio-Visual Speech Processing
Publication statusPublished - 2011
EventUnknown -
Duration: 1 Jan 2011 → …


  • Visual speech synthesis
  • Constrained asynchrony
  • AAM features

Fingerprint Dive into the research topics of 'Realistic Visual Speech Synthesis Based on AAM Features and an Articulatory DBN Model with Constrained Asynchrony'. Together they form a unique fingerprint.

Cite this