Postnet Layer

Postnet Layer

💣 Postnet Layer

In some end-to-end TTS models today, after the hidden representations are passed through the decoder we got the mel-spectrogram which contains the predictions of the acoustic features. Finally, the decoder predictions are passed over the Postnet layer which predicts residual information to improve the construction performance of the model . The section below notes some insights about the Postnet layer by me when learning TTS.
1. https://arxiv.org/pdf/1908.11535.pdf - 30 Aug 2019

In addition to the decoder, some systems have a post-net, an additional network that predicts acoustic features. A post-net was originally introduced to convert acoustic features to different acoustic features that were suitable for an adopted waveform synthesis method, for example, from mel spectrograms to linear spectrograms [2] or mel spectrograms to vocoder parameters [4]. In recent studies the role of the post-net was to improve the acoustic features predicted by the decoder to improve quality further [5, 6]. The post-net introduces an additional loss term in the objective function.

2. https://arxiv.org/pdf/2008.03388.pdf - 11 Aug 2020

Relative to DAR, C-DAR has three additional changes that do not significantly impact naturalness or controllability, but provide additional insights into F0 generation. First, a 5-layer postnet [3] follows the autoregressive RNN. We find that this postnet has the effect of reducing autoregressive sampling errors and tightening the posterior distribution around the argmax (Figure 2)

Le Minh Nguyen (nguyenlm)
Le Minh Nguyen (nguyenlm)
Research Engineer

A Software Engineer loves NLP & Speech Technology.

Next
Previous

Related