Non-Parallel Voice Conversion Using Weighted Generative Adversarial Networks

Block Diagram


In this paper, we suggest a novel way to train Generative Adversarial Network (GAN) for the purpose of non-parallel, many-to-many voice conversion. The goal of voice conversion (VC) is to transform speech from a source speaker to that of a target speaker without changing the phonetic contents. Based on ideas from Game Theory, we suggest to multiply the gradient of the Generator with suitable weights. Weights are calculated so that they increase the power of fake samples that fool the Discriminator resulting in a stronger Generator. Motivated by a recently presented GAN based approach for VC, StarGAN-VC, we suggest a variation to StarGAN, referred to as Weighted StarGAN (WeStarGAN). The experiments are conducted on standard CMU ARCTIC database. WeStarGAN-VC approach achieves significantly better relative performance and is clearly preferred over recently proposed StarGAN-VC method in terms of speech subjective quality and speaker similarity with 75% and 65% preference scores, respectively.

In Interspeech 2019