Intuitively, humans understand and infer among multiple types of data such as image, video and text, which are often represented by vector subjecting to very different distributions. How to integrate multimodal data for understanding the world is critical to boosting the power of general AI.

Zhiqian Chen is a Ph.D. candidate at Department of Computer Science, Virginia Tech, focusing on AI and interdisciplinary research.