The researchers developed predictive-vision software that uses machine learning to anticipate what actions should follow a given set of video frames. They fed the program with 600 hours of videos from YouTube and popular TV shows such as Big Bang Theory, The Office and Desperate Housewives to test and see if it can predict whether two people will shake hands, slap five, kiss or hug. In a second situation, the algorithm can anticipate what could appear in a video after five seconds. It searches for patterns and recognizable objects such as human faces, hands and many others. While human greetings may seem like arbitrary actions to predict, the task served as a more easily controllable test case for the researchers to study. “Humans automatically learn to anticipate actions through experience, which is what made us interested in trying to imbue computers with the same sort of common sense,” said Carl Vondrick, PhD student at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). “We just wanted to use random videos from YouTube,” Vondrick said. “The reason for television is that it’s easy for us to get access to that data, and it’s somewhat realistic in terms of describing everyday situations.” They showed the computer videos of people who are one second away from doing one of these four actions: hugging, kissing, high-fiving and handshaking. The AI was able to guess correctly 43% of the time, which compares to existing algorithms that could only do 36 per cent of the time. It is worth noting that even humans make mistakes on these tasks. For example, human subjects were only able to correctly predict the action 71 per cent of the time, researchers said. Even though it will be a long time before the algorithm is put into a practical use, researchers say future and more sophisticated versions could be applied in different fields. Computer systems that predict actions would open up new possibilities ranging from robots that can better navigate human environments, to emergency response systems that predict falls, to virtual reality headsets that feed you suggestions for what to do in different situations. Watch the video below to see how the algorithm works.

Source