VALOR


Tri-Modality Pretraining Data


VALOR-1M

characteristic

AudioVisual Captions

VALOR-1M is a large-scale tri-modality pretraining dataset which consists of one million audiable videos paired with audiovisual captions. Current video-language pretraining datasets like HowTo100M and HD_VILA_100M take ASR transcriptions as language modality, conveying subject opinions which are usually weak correlated to videos. By contrast, captions of VALOR-1M dataset describe vision and audio contents simultaneously from objective perspectives, in one fluent English sentence.

Rich Audio Concepts

All videos in VALOR-1M originate from AudioSet dataset, currently the largest sound classification dataset, and this ensures that most videos of VALOR-1M contain rich audio concepts and semantic meanings, including but not limited to sounding objects and sounding characteristics. In addition, current audio-language datasets are restricted in thousands level (e.g., ~50K examples in AudioCaps dataset and ~5K examples in ClothoV1 dataset.), while VALOR-1M is ten times or hundred times bigger.

High Quaility Labeling

Audiovisual captions in VALOR-1M are manually annotated by professional data labeling company, and we have employed strict check processes to ensure annotation quality. Compared to popular vision-language datasets like Conceptual Captions or WebVid taking noising alt-texts as language modality, we believe strong correlated audiable videos - audiovisual captions pairs can be a solid start point for tri-modality pretraining research.

Examples

"In the living room, a black dog lies on the sofa barking, with the sound of a police car in the background."

"The virtual character in the game is playing a horse race, he crosses obstacles, the hooves clatter and the bells rattle."

" In the music and a man's shout, the glow sticks flickered and the crowd cheered."

" In the dim crowd, a group of people were driven away in the sound of passionate music."

" Several people stretched their arms in the light music, and suddenly there was a scream."

" A person moves his mouse in a moving music and clicks on the English option to send an email. "

" The silver machine was working, and the yellow tape was rolling and buzzing."

" Amid fierce applause, a group of black men stood on the golden stage."

" A green car pulled up on the side of the road with its door open and its engine roared."

VALOR-32K

In the context of current established audiovisual-language benchmarks only target at audiovisal question answering (AVQA), we establish VALOR-32K benchmark to enlarge evaluation task fields, which consists of two tasks including audiovisual retrieval (AVR) and audiovisual captioning (AVC). AVC demands models to generate audiovisual captions for audiable videos. In AVR task, models are required to retrieve the most matching video candidate according to given audiovisual caption queries. Both AVR and AVC tasks are more challenging than their vision-language counterparts (i.e., text-to-video retrieval and video captioning tasks), due to the additional introduction of audio modality. VALOR-32K are splited into 25K/3.5K/3.5K videos for training, validation and testing, respectively. The same evaluation metrics of video retrieval (R@1, R@5, R@10) and video captioning (BLEU4, METEOR, ROUGE and CIDEr) are utilized for AVR and AVC tasks evaluation, respectively.