Deep learning speech synthesis

Lua error in package.lua at line 80: module 'strict' not found. Deep learning speech synthesis uses Deep Neural Networks (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Some DNN-based speech synthesizers are approaching the naturalness of the human voice.

Formulation

Given an input text or some sequence of linguistic unit $Y$ , the target speech $X$ can be derived by

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle X=\arg\max P(X|Y, \theta)}

where $\theta$ is the model parameter.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the Loss function is typically L1 or L2 loss. These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle loss=\alpha \text{loss}_{\text{human}} + (1 - \alpha) \text{loss}_{\text{other}}}

where Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \text{loss}_{\text{human}}

is the loss from human voice band and  $\alpha$  is a scalar typically around 0.5. The acoustic feature is typically Spectrogram or spectrogram in Mel scale. These features capture the time-frequency relation of speech signal and thus, it is sufficient to generate intelligent outputs with these acoustic features. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis because it reduces too much information.

Brief history

File:WaveNet animation.gif

A stack of dilated casual convolutional layers used in WaveNet^[1]

In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.^[1]

In early 2017, Mila proposed char2wav, a model to produce raw waveform in an end-to-end method. In the same year, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text; months later, Google proposed Tacotron2, which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice. Since then, end-to-end methods have become the hottest research topic because many researchers around the world have started to notice the power of end-to-end speech synthesizers.^[2]^[3]

Semi-supervised learning

Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research^[4]^[5] has shown that, with the aid of self-supervised loss, the need for paired data decreases.

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.^[6] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

File:Larynx-HiFi-GAN speech sample.wav

Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform $\mathbf{x}=\{x_1,...,x_T\}$ as a product of conditional probabilities as follows

$p_{\theta}(\mathbf{x})=\prod_{t=1}^{T}p(x_t|x_1,...,x_{t-1})$

where $\theta$ is the model parameter including many dilated convolution layers. Thus, each audio sample $x_t$ is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet^[7] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow^[8] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN,^[9] which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

Synthesis example

220px

The Chaos (short version) synthesized by VITS, a research deep-learning-based end-to-end text-to-speech method, using the LJ Speech dataset.

Problems playing this file? See media help.

References

↑ ^1.0 ^1.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.

[deepmind-1] 1.0 ^1.1 Lua error in package.lua at line 80: module 'strict' not found.

[2] Lua error in package.lua at line 80: module 'strict' not found.

[3] Lua error in package.lua at line 80: module 'strict' not found.

[4] Lua error in package.lua at line 80: module 'strict' not found.

[5] Lua error in package.lua at line 80: module 'strict' not found.

[6] Lua error in package.lua at line 80: module 'strict' not found.

[7] Lua error in package.lua at line 80: module 'strict' not found.

[8] Lua error in package.lua at line 80: module 'strict' not found.

[9] Lua error in package.lua at line 80: module 'strict' not found.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

v t e Speech synthesis
Free software	eSpeak Gnopernicus Gnuspeech Orca Festival Speech Synthesis System FreeTTS Sinsy Automatik Text Reader
Proprietary software	DECtalk Software Automatic Mouth Talk It! Microsoft Agent Microsoft Speech API Microsoft text-to-speech voices Readspeaker Voice browser CoolSpeech VoiceWeb BrowseAloud LaLaVoice Vocaloid Cantor Symphonic Choirs IVONA CereProc Utau Voiceroid NIAONiao Virtual Singer Vocalina Realivox CeVIO Creative Studio Chipspeech Alter/Ego PPG Phonem
Machine	Echo 2 Pattern playback Phasor RIAS Texas Instruments LPC Speech Chips TuVox
Applications	AOLbyPhone DialogOS Dr. Sbaitso MBROLA Microsoft Narrator Microsoft Speech Server PlainTalk Voice font
Protocols	Speech Synthesis Markup Language SABLE VoiceXML
Developers/ Researchers	Alan W. Black Catherine Browman Franklin Seaney Cooper Gunnar Fant Haskins Laboratories Wolfgang von Kempelen Ignatius Mattingly Philip Rubin Yamaha
Process	Articulatory synthesis Concatenative synthesis Currah Inverse filter PSOLA Phase vocoder Self-voicing

Deep learning speech synthesis

Contents

Formulation

Brief history

Semi-supervised learning

Zero-shot speaker adaptation

Neural vocoder

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools