BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

VIEW PUBLICATION

Takashi Shibuya

Yuhta Takida

Yuki Mitsufuji*

* External authors

ICASSP-2024

2024

Abstract

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications.

Related Publications

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

TMLR, 2024
Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji*

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity recon…

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

ICASSP, 2024
Eleonora Grassucci*, Yuki Mitsufuji*, Ping Zhang*, Danilo Comminiello*

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

ICASSP, 2024
Hao Shi*, Kazuki Shimada, Masato Hirano*, Takashi Shibuya, Yuichiro Koyama*, Zhi Zhong*, Shusuke Takahashi*, Tatsuya Kawahara*, Yuki Mitsufuji*

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the compl…

SEE ALL

HOME
Publications
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE