High-quality Flemish Text-to-Speech Synthesis

Lukas Latacz, Wesley Mattheyses, Werner Verhelst

Research output: Chapter in Book/Report/Conference proceedingMeeting abstract (Book)

Abstract

Even though speech synthesis is the most frequently needed language technology for people with communicative disabilities (e.g. see [1]), the number of commercially-available synthetic voices is still rather small, especially for medium-sized languages such as Dutch and for specific language-variants such as Flemish (i.e. Southern-Dutch). The lack of high-quality Flemish synthetic voices available for research purposes inspired us to construct a new high-quality speech synthesizer, the DSSP synthesizer, able to synthesize Flemish speech. New voices for the synthesizer are constructed using a speech database containing recordings of a single speaker and the corresponding orthographic transcription. Our aim has been to automate the voice-building process as much as possible by 1) an automatic annotation of these recordings, 2) automatically spotting potential errors, and 3) deriving optimal settings without human intervention. The annotated speech recordings form the basis for constructing a speaker-dependent front-end that captures the speaking style of the original speaker by modelling speaker-specific pronunciations, prosodic phrase breaks, silences, accented words and prominent syllables. The front-end performs the language-dependent processing part of the synthesis, for which it uses multiple linguistic levels (segment, syllable, word and phrase) and shallow linguistic features. The DSSP synthesizer uses the two dominant speech synthesis techniques, unit selection synthesis [2] and hidden Markov model-based (HMM) synthesis [3]. HMM-based synthesis uses a flexible statistical parametric model of speech, but sounds less natural than unit selection synthesis, which selects small units from the speech database and concatenates the corresponding waveforms. Various demonstrations of our Flemish voices will be given during the talk.

References
[1] M. Ruiter, L. Beijer, C. Cucchiarini, E. Krahmer, T. M. Rietveld, H. Strik, and Van hamme Hugo, “Human language technology and communicative disabilities: requirements and possibilities for the future,” Lang. Resour. Eval., vol. 46, no. 1, pp. 143–151, Mar. 2012.
[2] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, Atlanta, GA, USA, 1996, vol. 1, pp. 373–376.
[3] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, Y. Yamagishi, and K. Oura, “Speech Synthesis Based on Hidden Markov Models,” Proc. IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.

Original languageEnglish
Title of host publicationComputational Linguistics in the Netherlands (CLIN25)
Place of PublicationAntwerp, Belgium
Publication statusPublished - 6 Feb 2015
EventComputational Linguistics in the Netherlands 25 (CLIN25) - Antwerp, Belgium, Belgium
Duration: 5 Feb 20156 Feb 2015

Conference

ConferenceComputational Linguistics in the Netherlands 25 (CLIN25)
CountryBelgium
CityAntwerp, Belgium
Period5/02/156/02/15

Fingerprint Dive into the research topics of 'High-quality Flemish Text-to-Speech Synthesis'. Together they form a unique fingerprint.

Cite this