VALOR-1M is a large-scale tri-modality pretraining dataset which consists of one million audiable videos paired with audiovisual captions. Current video-language pretraining datasets like HowTo100M and HD_VILA_100M take ASR transcriptions as language modality, conveying subject opinions which are usually weak correlated to videos. By contrast, captions of VALOR-1M dataset describe vision and audio contents simultaneously from objective perspectives, in one fluent English sentence.
All videos in VALOR-1M originate from AudioSet dataset, currently the largest sound classification dataset, and this ensures that most videos of VALOR-1M contain rich audio concepts and semantic meanings, including but not limited to sounding objects and sounding characteristics. In addition, current audio-language datasets are restricted in thousands level (e.g., ~50K examples in AudioCaps dataset and ~5K examples in ClothoV1 dataset.), while VALOR-1M is ten times or hundred times bigger.
Audiovisual captions in VALOR-1M are manually annotated by professional data labeling company, and we have employed strict check processes to ensure annotation quality. Compared to popular vision-language datasets like Conceptual Captions or WebVid taking noising alt-texts as language modality, we believe strong correlated audiable videos - audiovisual captions pairs can be a solid start point for tri-modality pretraining research.
In the context of current established audiovisual-language benchmarks only target at audiovisal question answering (AVQA), we establish VALOR-32K benchmark to enlarge evaluation task fields, which consists of two tasks including audiovisual retrieval (AVR) and audiovisual captioning (AVC). AVC demands models to generate audiovisual captions for audiable videos. In AVR task, models are required to retrieve the most matching video candidate according to given audiovisual caption queries. Both AVR and AVC tasks are more challenging than their vision-language counterparts (i.e., text-to-video retrieval and video captioning tasks), due to the additional introduction of audio modality. VALOR-32K are splited into 25K/3.5K/3.5K videos for training, validation and testing, respectively. The same evaluation metrics of video retrieval (R@1, R@5, R@10) and video captioning (BLEU4, METEOR, ROUGE and CIDEr) are utilized for AVR and AVC tasks evaluation, respectively.