Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Frontiers of Information Technology & Electronic Engineering >> 2022, Volume 23, Issue 7 doi: 10.1631/FITEE.2100501

Efficient decoding self-attention for end-to-end speech synthesis

Affiliation(s): College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China; Institute of Robotics, Zhejiang University, Yuyao 315400, China; less

Received: 2021-10-21 Accepted: 2022-07-21 Available online: 2022-07-21

Next Previous

Abstract

has been innovatively applied to text-to-speech (TTS) because of its parallel structure and superior strength in modeling sequential data. However, when used in with an autoregressive decoding scheme, its inference speed becomes relatively low due to the quadratic complexity in sequence length. This problem becomes particularly severe on devices without graphics processing units (GPUs). To alleviate the dilemma, we propose an (EDSA) module as an alternative. Combined with a dynamic programming decoding procedure, TTS model inference can be effectively accelerated to have a linear computation complexity. We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720% and 50% higher inference speed on the central processing unit (CPU) and GPU respectively, with almost the same performance. Thus, this method may make the deployment of such models easier when there are limited GPU resources. In addition, our model may perform better than the baseline Transformer TTS on out-of-domain utterances.

Related Research