The new AI model shows how machines can learn vision, language and sound together

The picture shows how machines learn from sight, language and sound together.

Most of us have watched TV with its sound off at one time or another. Although it is usually possible to follow the story at least to some extent, the lack of an audio track limits our ability to fully appreciate the events.

Similarly, it is easy to lose a lot of information just by listening to the sounds coming from another room. The multimodality of the combination of image, sound and other details greatly improves our understanding of what is happening, whether on television or in the real world.

This seems to apply to artificial intelligence as well. A new model of answering questions called MERLOT RESERVE out-of-the-box prediction allows to reveal a strong understanding of maternal health. It was recently developed by a team from the Allen Institute for Artificial Intelligence (AI2), the University of Washington and the University of Edinburgh.

Part of a new generation of AI applications that allows semantic search, analysis, and Q&A, the system was trained by “watching” 20 million YouTube videos. The opportunities shown are already being commercialized by startups such as Twelve Labs and Clipr.

MERLOT RESERVE (Short for RESERVE), means the study of multimodal representation of events over time with repeated control of events, and is built on the previous model of MERLOT team. It has been studied in millions of videos in advance, from the combined input of images, audio and their transcripts. Individual frameworks allow the system to study spatial space, while video-level learning provides it with temporary information and teaches it about the relationships between elements that change over time.

“Artificial processing is different from human processing,” said computer scientist and project manager Rowan Zellers. “But there are some general principles that if we want to build robust AI systems, it’s going to be hard to avoid them. I think multimodality is definitely in this bucket.”

Rowan Zellers, a researcher at the University of Washington and the Allen Institute of Artificial Sciences.

Because we live in a dynamic world, the team wanted to learn how to build machines that would combine vision, language, and sound. In one example of paper, someone is seen baking popcorn. Only from images and dialogue can we imagine the sounds that accompany them. The sound of uncooked kernels moving on the metal surface of the pot can eventually turn into energy “pops” when they enter the tasty white popcorn.

Such a prediction is known as ‘learning from entry’, in which a time-bound ratio allows a method to teach others. This has been hypothesized by some developmental psychologists as to how we learn visual and global knowledge, often without a teacher. It is also the basis of the RESERVE name: Duplicate event tracking.

The model is trained in a 40-second video session in which pieces of text and audio are “masked” from the system. RESERVE is then studied by selecting the correct masked piece from four multiple choice options. It is then selected from four possible arguments to substantiate its answer.

This approach allowed RESERVE not only to achieve modern results from its semi-controlled study, but also to generate strong zero predictions. In this case, an example of zero prediction could be a question, such as “What does a person do?” This can be done manually or automatically as a statement like “Man [MASK]. ”The model then predicts several choices based on a set of options offered, such as“ cooking popcorn ”or“ eating popcorn ”.

RESERVE has been fine-tuned in several large data sets that are used for visual comprehension at the cognitive level: VCR, TVQA, and Kinetics-600. RESERVE has demonstrated modern performance, improving previous performance by 5%, 7% and 1.5%, respectively. By incorporating audio, the model in the Kinetics-600 achieves 91.1% accuracy.

VCR (Visual Commonsense Reasoning) is a comprehensive set of data that does not contain audio, which is used for visual comprehension at the cognitive level. TVQA is a large-scale collection of QA videos based on six popular TV shows (Friends, great explosion theory, how I met your mother, House MD, Gray anatomy, and Castle). Finally, Kinetics-600 is a collection of 650,000 video clips that include hundreds of human action lessons.

According to a research paper presented at the IEEE / CVF International Conference on Computer Vision and Model Recognition in June, RESERVE shows significant improvements over competing models. For example, it requires one-fifth of the floating point operation used by the VisualBERT multimodal model.

The project team expects pre-designed video models to one day be able to help deaf or hard of hearing users or be used to understand extractions about video-watching trends. However, they also recognize the data sets that are used to study the RESERVE, creating unavoidable biases that need to be addressed.

In addition to spoken words, audio can provide a lot of additional contextual information. This should come as no surprise to us, based on our own experiences, but interestingly, AI performance can be significantly improved through this as well. This may be due to the fact that a new statistical correlation is possible when coordinating additional information.

“Audio is a lot of things. It’s not just sound, but also sound effects, and listening to these sound effects will improve your understanding of the world,” Zellers noted.

“Another thing is the tone of voice, the dynamics of human communication. If you just look at the words, without the audio context, you lose a lot. But if someone says that word with specific emotion, the model can work much better. And in in fact, we see it working. ”

MERLOT and RESERVE are part of the Mosaic AI2 team, which focuses on developing systems that can scale and develop machine ideas. Conventional machines have been an area of ​​interest in the field of artificial intelligence for decades. The ability to factor and predict real-world relationships between different objects and processes makes our AI tools much more useful to us.

However, just uploading a set of facts and rules about how the world works on the system and expecting it to work is not enough. The world is too complicated to do that. On the other hand, from the moment we are born, we communicate with our environment through our various emotions. We are gradually gaining an understanding of what is happening in the world and why. Some simple car projects use this approach. For MERLOT and RESERVE, the introduction of additional methods provides additional information, just as our feelings.

“I think in the medium and long term, what really excites me is AI, which talks to us in different forms like audio and gestures so it can talk about what we’re doing. be connected, ”Zellers said. The authors of the article “MERLOT RESERVE: Neural script knowledge through vision, language and sound” are Rowan Zellers, Jiasen Lu, Simin Lu, Yongjae Yu, Yangpeng Zhao, Muhammadriza Solehi, Aditya Kusupati, Jack Hess. An example of RESERVE can be found in AI2.

Leave a Comment