Authors

* External authors

Venue

Date

Share

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada

Archontis Politis*

Parthasaarathy Sudarsanam*

Daniel Krause*

Kengo Uchida

Sharath Adavann*

Aapo Hakala*

Yuichiro Koyama*

Naoya Takahashi

Shusuke Takahashi*

Tuomas Virtanen*

Yuki Mitsufuji*

* External authors

NeurIPS 2023

2023

Abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

Related Publications

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP, 2024
Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji*

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re…

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

TMLR, 2024
Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji*

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity recon…

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

ICASSP, 2024
Eleonora Grassucci*, Yuki Mitsufuji*, Ping Zhang*, Danilo Comminiello*

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re…

  • HOME
  • Publications
  • STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.