Pretraining techinques (single modality or cross-modality) desires to teaches neural networks the most basic material distributions, semantic concepts and modality relationships, in an effecient unsupervised way. After pretraining, the model can be directly applied to some praitical applciations (i.e., zero-shot inference), or undergos a field-specific second-stage training (i.e., finetuning). We initiate tri-modality pretraining which emphsizes to help model understand three modalities including vision, audio and language, and also the transformations among them simultaneously. We expect that tri-modality pretraining can improve related single-/dual-/tri-modality downstream tasks, which means bring in more possibilities meanwhile keep all the benifits of dual-modality pretraining (vision-language). The two most important components of tri-modality pretraining are resonable tri-modality neural network designing and tri-modality training data collection.