Scanning several
videos on the same how-to topic, a computer finds instructions they
have in common and
combines them into one step-by-step series.
(December 20, 2015) When
you hire new workers you might sit them down to watch an instructional video on
how to do the job. What happens when you buy a new robot?
Cornell researchers are teaching robots to watch
instructional videos and derive a series of step-by-step instructions to
perform a task. You won’t even have to turn on the DVD player; the robot can
look up what it needs on YouTube. The work is aimed at a future when we may
have “personal robots” to perform everyday housework – cooking, washing dishes,
doing the laundry, feeding the cat – as well as to assist the elderly and
people with disabilities.
The researchers call their project ”RoboWatch.” Part of what
makes it possible is that there is a common underlying structure to most how-to
videos. And, there’s plenty of source material available. YouTube offers
180,000 videos on “How to make an omelet” and 281,000 on “How to tie a bowtie.”
By scanning multiple videos on the same task, a computer can find what they all
have in common and reduce that to simple step-by-step instructions in natural
language.
Why do people post all these videos? “Maybe to help people
or maybe just to show off,” said graduate student Ozan Sener, lead author of a
paper on the video parsing method presented Dec. 16 at the International
Conference on Computer Vision in Santiago, Chile. Sener collaborated with colleagues at Stanford University, where he
is currently a visiting researcher.