Fighting Spam: Use Wheelbarrows

CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart." That's not good enough. We need a "Completely Automated Public Test to tell Humans and other Humans Apart" or, CAPTHHA.

Why? Read this ZDNet article: http://blogs.zdnet.com/security/?p=1835 (note: ZDNet menu completely busted in Firefox).

The article talks about the availability of broken CAPTCHAs for sale. Companies in India (possibly elsewhere, but this article highlights India) will, for a small fee ($2 for 1000), provide the means for a computer to bypass the CAPTCHA - thus disabling the CAPTCHA's sole purpose: keeping automated systems out.

Often CAPTCHAs are used to stop spammers. Hotmail, GMail, etc. use CAPTCHAs to ensure that new emails are requested by humans, not computers (since a computers can and do get 1000s of email addresses from which to send spam). Spammers can automate the sign-up, the CAPTCHA is what stops them. And enterprising companies are making it possible to break them.

I'm going to skip the potential discussions on the ethics / morality of this. It's happening, let's just look at why and what can be done about it. I'm also going to pretend that spam is the main problem being stopped. It's not.

I think the problem: the reason this happens - is a cultural understanding of labor costs has caused a fundamental flaw in the understanding of the problem. That's not to say CAPTCHAs aren't useful, even needed. CAPTCHAs make it harder or at least adds a cost associated with the abuse and that difficulty or cost will stop some would-be spammers.

But the thinking underlying CAPTCHAs is this:

Tools are cheap - we'll buy them whenever we need them. Labor is expensive, so we'll buy tools to save us labor costs.

The computer is the tool in this case. The problem that they're having is that this assumption doesn't hold true globally.

Cement Truck or Wheel Barrows?

I spent many of my early years in Ecuador. I can remember one of my classrooms was on the top floor that overlooked the walls of the school. The next-door hospital was expanding and I watched as people went to and fro pouring cement floors in a new hospital building.

They did this manually with wheelbarrows.

In North America and Europe we might pour it directly from a truck. 3 people are involved (2 to hold the hose, 1 to drive the truck) and 1 large, expensive tool.

However, when the labor is cheap, it's cost-effective to get people using wheel barrows to move cement around and skip the truck. You might have 18 people, with 18 cheap tools (wheel barrows). In North America, the cost of the additional 15 laborers is way higher than the cost of the truck - so you buy a truck and save 15 people and some money.

I think the CAPTCHA approach stems from the same mindset. CAPTCHAs differentiate between computer and humans and the premise is that spammers won't have a fleet of people to break them. They'll try to use a tool (a computer) to break the CAPTCHA. The premise is based on the same equation: labor = expensive, tools = cheaper.

So the solution (the CAPTCHA) is specifically aimed at being unbreakable to computers). But the premise doesn't hold everywhere and so the solution doesn't work.

What to do?

Aside: Leverage work with reCAPTCHA

I've talked about reCAPTCHA before: /2007/8/15/recaptcha. I love its premise.

But, while not solving the problem, it bears an interesting side effect in this situation.

Since a reCAPTCHA has 2 words: one known and the other unknown - and the unknown word is from a scanned book. The unknown words they gather together and, as people identify the word, they then know what that word is. They are effectively crowdsourcing the transcription of books.

With reCAPTCHA, spammers who are paying for the breaking of the CAPTCHAs, would actually help in the effort. Effectively paying people to transcribe these books.

I'm sure that there would be ways of rendering this good work ineffective, but they may not bother. Since those providing the reading service has not motivation to stop it: they are happy to break the CAPTCHA: whether it is useful to society or not.

Flip It

But to actually solve the problem: allow only "legitimate" users into the system, you have to re-evaluate the problem statement. CAPTCHA, as a test to "tell Computer and Humans Apart" works, but it doesn't solve the problem.

To stop this, we need to tell legitimate users apart from illegitimate users. And this is really "by their actions."

You might build a system that would monitor the activity in some big-brother-esque manner. But I'd suggest a much simpler approach:

  • Skip it: let them spam and just have good spam tools

To have good spam tools you need:

  1. The spam tool itself (to detect & stop the unwanted behavior)
  2. Computing power to run the tool

The costs of the needed computing power is approaching to 0. Amazon Web Services charges $0.15 for a GB of storage: my first computer cost well over $0.15 but stored just 40 MB. You can get a computer for $0.10 per hour with 1.7GB of RAM: my first computer again, cost more than that, but had 128 MB of RAM. The cost of storage, processing, and memory required to stop spam is continually decreasing.

And we have tools to stop spam. I am regularly surprised that many people just aren't using good tools. GMail works fairly well - I regularly have hundreds of emails in my spam folder, but I've received 6 spam emails in my inbox - for the lifetime of my GMail usage. At work, I moved our email system to the open-source Spam Assassin. Open-source means that we can use it freely. And it works very well (which is what matters).

HAM: Everything Else

The article lists more than just email (which is the existing SPAM tools are targeted at). What about Craigslist, MySpace, YouTube, and Facebook?

The only trouble spot I see is YouTube. The others are text based - and spam as text comments or spam as email can be handled in a similar fashion.

YouTube? That's trickier. If they are doing comment spam, then that's still text and can be handled by the same tools as anti-spam. To a computer: text is text.

But if they want to post to YouTube - their pre-built video that is, essentially, just Spam. That's a harder proposition. You could after the text of the title (since SPAM subject lines are usually give-aways), but that could be spoofed by real humans (we're working under the proposition that we can get real human labor for cheap): writing new subject lines just as soon as one is flagged as spam.

We have a problem: the current state of technology can't understand video in a meaningful sense (more on this topic in the future - starting Thursday).

I'm sure that there are other solutions, but let's get real simple about this. We got into this problem because of our worldview. Now that we've learned that labor isn't overly expensive everywhere, we can solve it that way too.

We can set up shops of people that review the videos to determine if they are spam. Actual human determination of "spam" or "not spam."

A few points on this:

  • You might be able to get people for a lower wage than the spammers, since the hired people feel like they are helping societal good.
  • You are a big company and you have the resources & understanding of tools to build better tools to make your reviewers more productive than the CAPTCHA crackers

There is a potential problem in that it's more work to watch a video and determine that it is spam than it is to post a video that is spam.

But you can use crowdsourcing to help you. The existing "flag" button allows humans watching the video to tip you off. Then you review those flagged videos more closely. Or, once flagged by enough users, remove the video until it is reviewed and is determined to be valid.

Such an approach isn't perfect, but if it works well enough, then the spam won't be as economically interesting. You don't have to have a perfect system: it just has to be more costly to spam than the potential value available when spamming.

Summary

I think that spam is going to happen. I also think that global economics change how we should think about spam - along with other things. Rather than trying to stop it, you have to change the outcome. If it's easy to do, it will happen. The less cost-effective it is, the fewer it will be done.

Also, your mindset is, by definition, attached to the cultures that you have experienced. And most of the world is not you and may be thinking differently about things. You need to see things differently to see the dangers and opportunities that exist in a different world.

Sunday, August 31, 2008, 12:00 AM

tagged: economics, crowdsourcing, cultures, youtube, videotechnology