Intro to Data Overwhelmness: The next VCR

I use OCR on a regular basis: Optical Character Recognition. I scan in articles from dead-tree magazines and the computer turns the scanned image back into words that I can later search1. OCR is mature enough that it works well.

I've been thinking about the great weight of information (aka data) that is bearing down upon us. I'm definitively feeling it. And the tension is it feels like you can't ignore it without throwing the baby out with the bathwater. I think that I have some techniques that work but that are, at best, just a part of the solution and don't really solve it: they just mitigate it.

There are a few reasons that this problem seems compelling to me:

I think that we have yet to see the worst of this problem - it's going to continue to get worse (as we have access to more data in more places, more easily)
It's a tough problem to solve
Particularly if you want to solve it in a way that works for a majority of people.
I think that it is going to be a big problem as it get solved. This might mean that there is some good opportunity here, I'm not sure.

This isn't a problem that can be solved by technology alone. It's going to require a mix of:

Processes
These maybe "lifehacks"-style things: such as what I do (I'll describe further in upcoming blog posts, stay tuned)
Practices
To some extent, we'll have to just adjust our expectations of timing of review,
Technology
We do need some smart technology to help. I think that we have some of the tools available (and by tools I mean thoughts, ideas, and algorithms: but I haven't seen anything that implements them well in a comprehensive fashion yet)

In a recent video teaser I saw, Jeffrey Veen (formerly of Google), shared a great statistic (video below):

Every minute of the day, ten hours of video gets uploaded to YouTube.

If you wonder, even for a second, whether there is a problem of too much data for people to keep up with, this should help you see the truth.

How much data?

Let's extrapolate out from that statistic. Most of that 10 hours / minute is not interesting or useful to you (or more).

But how much is?

Maybe 1%. But's let's use 1/10 of 1% to calculate from. (I prefer being pessimistic in my back-of-napkin calculations.)

Also, we'll let that 10 hours of video stand in for "10 hours of content" - anything that might be interesting, video or not - across the web. In that case, what' percentage of all content does YouTube account for? For me, maybe 1%. Again, let's be pessimistic and call it 10%.

So, if we have:

10 hours every minute, that is 14,400 hours of video daily
(feel free to follow along at home on your calculators)
Saying that 0.1% is interesting, so 86.5 interesting minutes daily (14,400 x 0.0001)
If this is just 10% of all the potentially interesting information online (across all forms, not just video), then we have 864 minutes, or 14.4 hours of new content daily that might be useful to us.

14 hours poses a problem for me. For 2 reasons:

I don't spend 14 hours daily finding information
Even if I did, that's just the time it takes to read/watch the information, not to find it (it would take some time to find the 14.4 interesting hours amidst the 144,000 created)

Let's also ignore the problem of falling behind (by say, going on vacation and finding 0 hours of content for days on end).

Arguably, we don't need to find all the useful information. That's true. Google is finding it and we can later search it. And that is a great start to the problem. Google has done an excellent job of making search just plain work.

But there are a number of areas in which Google currently does poorly. One of those is video. We need a new VCR.

VCR

OCR stands for Optical Character Recognition, similarly let us repurpose the VCR acronym to mean Video Content Recognition.

Have you ever tried to do a search for videos with Google? Let's try this video as a sample. Google "veen data overload." Answer right on top. Great! How about "veen youtube stastic." Also near the top (#3 for me). That's pretty good.

"So, what the problem?" you ask.

Well, how did Google find that? Did you scroll down the page on those links? There is a transcript. Google read the transcript. Google is great at reading text - that's what it does. But Google did not read the video - it has next to zero information on what is in the video2. That's the problem: it was helped in this case, but this case is unusual.

Does most video uploaded on YouTube have a transcript with it? Nope. So Google knows very little about the content of that video.

On YouTube: What about Randy Pausch's last lecture?
If you search for "childhood dream nfl" you do see that Google knows that information is in the lecture (currently, spot #1 is someone blogging about it and spot #7 is a wikipedia page referencing it).
But Google has that video, yet it has no idea that that video contains that information.
The following (also parts of the lecture), don't register that YouTube video in the first page of results:
- "disney imagineer"
  (spot 10 is a link to thedisneyblog.com referencing the video)
- "winning stuffed animals"
  (the entire content of one of the slides in the lecture)
Another lecture on YouTube: Merlin Mann's excellent "Inbox Zero"
- "time attention finite"
  No results - though this is a major topic in the video (6:10 - 9:15) as well as a title in one of the slides
- "do email less"
  Again, a major point & a complete slide.

I've picked ones that I think should be easier to see the failure. It gets harder from here.

They are on YouTube - right in Google's backyard.
There is a lot of activity on the web where people will make the link from the content & text to the video for you - but the results aren't there.
I've done searches for slides - if they even ran OCR over every few seconds of video, they'd have this.
We have the technology to do this, the computer power to do it is cheap. This seems like a simple start.

I have no doubt that Google is working on this particular "issue." But that's only one of the problems with the data overload that we are just beginning to recognize.

Update Oct 22: I discovered that on Sept 15 (11 days after this post), they dropped a public beta of "Audio Indexing" in Google Labs. You can search based on the words spoken in a video. Check it out here: http://labs.google.com/gaudi.

Video

Jeff Veen on data overload from Jeffrey Zeldman on Vimeo.

Links

I actually use Microsoft Office's OneNote for this and it works excellent. Evernote promises to do this as well, but sadly, can't deliver.
Google, I believe, does read meta information about images & video. In this case, there is none. Neither is the video filename useful (which it also could "read"). In this case it is: "moogaloop.swf?clip_id=1206306&server=www.vimeo.com&show_title=1&show_byline=1&show_portrait=0&color=&fullscreen=1" - not particularly helpful.
I say "next to" zero because Google has read the text below. And it is smart enough to surmise that the video may have something to do with the text. Although the particular relationship is likely foggy, at best, to Google.

Thursday, September 04, 2008, 12:00 AM

tagged: videos, google, informationfiltering, informationoverload, searching, videotechnology

Jeffrey Priebe