I use OCR on a regular basis: Optical Character Recognition. I
scan in articles from dead-tree magazines and the computer turns
the scanned image back into words that I can later search1. OCR
is mature enough that it works well.
I've been thinking about the great weight of information (aka
data) that is bearing down upon us. I'm definitively feeling it.
And the tension is it feels like you can't ignore it without
throwing the baby out with the bathwater. I think that I have some
techniques that work but that are, at best, just a part of the
solution and don't really solve it: they just mitigate it.
There are a few reasons that this problem seems compelling to
me:
- I think that we have yet to see the worst of this problem -
it's going to continue to get worse (as we have access to more data
in more places, more easily)
- It's a tough problem to solve
Particularly if you want to solve it in a way that works for a
majority of people.
- I think that it is going to be a big problem as it get solved.
This might mean that there is some good opportunity here, I'm not
sure.
This isn't a problem that can be solved by technology alone.
It's going to require a mix of:
- Processes
These maybe "lifehacks"-style
things: such as what I do (I'll describe further in upcoming blog
posts, stay tuned)
- Practices
To some extent, we'll have to just adjust our expectations of
timing of review,
- Technology
We do need some smart technology to help. I think that we have
some of the tools available (and by tools I mean thoughts, ideas,
and algorithms: but I haven't seen anything that implements them
well in a comprehensive fashion yet)
In a recent video teaser I saw, Jeffrey Veen (formerly of
Google), shared a great statistic (video below):
Every minute of the day, ten hours of video gets uploaded to
YouTube.
If you wonder, even for a second, whether there is a problem of
too much data for people to keep up with, this should help you see
the truth.
How much data?
Let's extrapolate out from that statistic. Most of that 10 hours
/ minute is not interesting or useful to you (or more).
But how much is?
Maybe 1%. But's let's use 1/10 of 1% to calculate from. (I
prefer being pessimistic in my back-of-napkin calculations.)
Also, we'll let that 10 hours of video stand in for "10 hours of
content" - anything that might be interesting, video or not -
across the web. In that case, what' percentage of all content does
YouTube account for? For me, maybe 1%. Again, let's be pessimistic
and call it 10%.
So, if we have:
- 10 hours every minute, that is 14,400 hours of video
daily
(feel free to follow along at home on your calculators)
- Saying that 0.1% is interesting, so 86.5 interesting
minutes daily (14,400 x 0.0001)
- If this is just 10% of all the potentially interesting
information online (across all forms, not just video), then we have
864 minutes, or 14.4 hours of new content daily that might
be useful to us.
14 hours poses a problem for me. For 2 reasons:
- I don't spend 14 hours daily finding information
- Even if I did, that's just the time it takes to read/watch the
information, not to find it (it would take some time to find the
14.4 interesting hours amidst the 144,000 created)
Let's also ignore the problem of falling behind (by say, going
on vacation and finding 0 hours of content for days on end).
Arguably, we don't need to find all the useful information.
That's true. Google is finding it and we can later search it. And
that is a great start to the problem. Google has done an excellent
job of making search just plain work.
But there are a number of areas in which Google currently does
poorly. One of those is video. We need a new VCR.
VCR
OCR stands for Optical Character Recognition, similarly let us
repurpose the VCR acronym to mean Video Content Recognition.
Have you ever tried to do a search for videos with Google? Let's
try this video as a sample. Google "veen data overload." Answer
right on top. Great! How about "veen youtube stastic." Also near
the top (#3 for me). That's pretty good.
"So, what the problem?" you ask.
Well, how did Google find that? Did you scroll down the page on
those links? There is a transcript. Google read the transcript.
Google is great at reading text - that's what it does. But Google
did not read the video - it has next to zero information on what is
in the video2. That's the problem: it was helped in this case,
but this case is unusual.
Does most video uploaded on YouTube have a transcript with it?
Nope. So Google knows very little about the content of that
video.
- On YouTube: What about Randy Pausch's
last lecture?
If you search for "childhood dream nfl" you do see that Google
knows that information is in the lecture (currently, spot #1 is
someone blogging about it and spot #7 is a
wikipedia page referencing it).
But Google has that video, yet it has no idea that that video
contains that information.
The following (also parts of the lecture), don't register that
YouTube video in the first page of results:
- "disney imagineer"
(spot 10 is a link to thedisneyblog.com referencing the
video)
- "winning stuffed animals"
(the entire content of one of the slides in the lecture)
- Another lecture on YouTube: Merlin Mann's
excellent "Inbox Zero"
- "time attention finite"
No results - though this is a major topic in the video (6:10 -
9:15) as well as a title in one of the slides
- "do email less"
Again, a major point & a complete slide.
I've picked ones that I think should be easier to see the
failure. It gets harder from here.
- They are on YouTube - right in Google's backyard.
- There is a lot of activity on the web where people will make
the link from the content & text to the video for you - but the
results aren't there.
- I've done searches for slides - if they even ran OCR over every
few seconds of video, they'd have this.
We have the technology to do this, the
computer power to do it is cheap. This seems like a simple
start.
I have no doubt that Google is working on this particular
"issue." But that's only one of the problems with the data overload
that we are just beginning to recognize.
Update Oct 22: I discovered that on Sept 15 (11 days after this
post), they dropped a public beta of "Audio Indexing" in Google
Labs. You can search based on the words spoken in a video. Check it
out here: http://labs.google.com/gaudi.
Video
Jeff
Veen on data overload from Jeffrey
Zeldman on Vimeo.
Links
- I actually use Microsoft Office's
OneNote for this and it works excellent. Evernote promises to do
this as well, but sadly, can't deliver.
- Google, I believe, does read meta
information about images & video. In this case, there is none.
Neither is the video filename useful (which it also could "read").
In this case it is:
"moogaloop.swf?clip_id=1206306&server=www.vimeo.com&show_title=1&show_byline=1&show_portrait=0&color=&fullscreen=1"
- not particularly helpful.
I say "next to" zero because Google has read the text
below. And it is smart enough to surmise that the video may have
something to do with the text. Although the particular relationship
is likely foggy, at best, to Google.