VALOR


Vision-Audio-Language


Tri-Modality Pretraining

What is Tri-Modality Pretraining ?

Pretraining techinques (single modality or cross-modality) desires to teaches neural networks the most basic material distributions, semantic concepts and modality relationships, in an effecient unsupervised way. After pretraining, the model can be directly applied to some praitical applciations (i.e., zero-shot inference), or undergos a field-specific second-stage training (i.e., finetuning). We initiate tri-modality pretraining which emphsizes to help model understand three modalities including vision, audio and language, and also the transformations among them simultaneously. We expect that tri-modality pretraining can improve related single-/dual-/tri-modality downstream tasks, which means bring in more possibilities meanwhile keep all the benifits of dual-modality pretraining (vision-language). The two most important components of tri-modality pretraining are resonable tri-modality neural network designing and tri-modality training data collection.

Why Tri-Modality Pretraining ?

  • Most videos in our dailylife senerio records audio track, containing rich semantic information correlated or complementary to visual track, utilizing both vision and audio signals can incrediblely enhance video understanding.
  • Cross-modality pretraining techinques can excites neural networks to the greatest extent with the help of massive paired multimodal data. However, current researches pay more attention to vison-language pretraining (VLP). By contrast, audio-language pretraining (ALP) and audiovisual-language pretraining (AVLP) are relatively less explored, but they are also valuable and essential to improve downstream audio-langauge and audiovisual-language applications.
  • Compared to separately training vision-language, audio-language and audiovisual-language pretraining models, tri-modality pretraining are desired to intergrate VLP, ALP, AVLP into a single unified framework, and results in one powerful unified intelligent system instead of three unlinked domain-specific pretraining models, which possess both high efficiency and effictiveness.