WintercoreLabs Thinking code

4May/08Off

Toward a new generation of audio captchas

It seems the post "Breaking Gmail's audio Captcha" has been slashdotted so many interesting discussions have emerged as a result. It's worth noting that there is nothing specially exciting in the approach used to break the google audio captcha, merely a bunch of signal analysis and pattern recognition principles applied. Almost any Voice Recognition / Audio processing software developer can break not only that captcha but, nowadays, any other.

I was planning to write about audio captchas may pose a future attack vector for spammers, but after googling a couple of minutes I stumbled upon the following offer: http://www.getafreelancer.com/projects/C-C-Audio-Services/Recognize-Voice-Captcha-Google.html

Unfortunately the future is right now.

Audio captchas are the alternative to image captchas for visual impaired persons. Anyway, both captchas must share an unconditional point:

  • A captcha should be easily solved by humans by taking into account the human nature only, not the level of culture. We have to demonstrate we are humans, not our IC.

Despite of this fact, 99% of the captchas are still today presenting alphanumeric challenges which, although by using this approach we make sure that 99% of people having access to a computer will know how to solve those challenges, are more related with culture rather than with the human being background.

Microsoft Research thought about that fact, I guess, and then came up with the Asirra captcha. Have you heard of someone who has not seen a cat or a dog? Probably not, but have you heard of someone who is illiterate? Probably, yes. That's the difference, as human beings we may learn a lot of things from other humans but what is inherent to our human being condition is the capability to "automatically" interact with our enviroment. It may be difficult to understand how to solve a differential equation but if you see a cat, you see cat, you know that is a cat. You have been seeing cats since you were a kid, in your friend's house, in the park, in the TV, in the petshop... Your brain is not working too much to bring you that information.

So the question is, how to apply this concept to audio Captchas? Well, the same concept. Do you remember when you saw cats? you also heard them, right? so could you distinguish a cat from a dog just hearing their characteristical "sounds"? Obviously, you do.

99.9% of peope can likely distinguish a dog barking from a cat meowing. Now, think for a while as you were a computer( make that effort please ;) ): how to distinguish a cat from a dog ? really difficult. If you let me make the comparison, this sort of captchas should be something similar to metamorphic viruses.

To make the issue harder to solve, now put that dog barking in the middle of a crowded and noisy street, even then, you likely know there is a dog messing around. However, let's imagine a computer filtering n previously unknown features looking for a barely predictable vector holding the set of features that represent a dog barking or whatever, since the automated agent cannot predict what will be the challenge proposed by the captcha, whilst nowadays a bot knows beforehand it faces an alphanumeric question. Computationally this is a really complex problem...First off, the automated agent should syntactically analyze the question and then proceed to mine the audio captcha relying on its own "world's sounds" database. Totally unreliable for automated agents, at least for non-government-supported ones ;)

  • A baby crying.
  • A thunderstorm
  • A baby crying in the middle of a thunderstorm
  • A dog barking while a baby is crying in the middle of a thunderstorm.
  • ...

But, what about the question? I mean, for an audio captcha playing a baby crying what should be the question... Whatever you want. If you insert into a database all the sounds of babies crying along their proper tags, you can make up any question you want, either generic or more specific

i.e : baby_crying_3.wav -> "baby", "crying".

Question 1: What does represent this sound?

Answer: Mmm, eerrr I think it's a baby crying !!

Question 2: What does the baby ( dog, cat, airplane...) do?

Answer: Mmm, eerrr I think the baby is crying !!

Or even

girl_alphanumeric_sequence_5_5_2_4_5.wav -> "girl", "five","five","two","four","five","numbers".

Question 3: What does represent this sound?

Answer: Mmm, eerrr I think it's a girl saying numbers !!

Note the second question exposes too much information to the "attacker", being suitable for a purely syntantic attack since a baby (dog, cat...) cannot do a lot of things...

You should use syntatic recognition for parsing the answers. If you don't have the means, regexps, a dictionary and the levenshtein distance, for dealing with spelling errors, should work like a charm.

You can distort, speed up, slow down, cut, expand... these captchas, making the issue harder to solve.

Without any doubt, the "natural captchas" are an interesting field for researchers.

Ruben Santamarta,

R&D/Reverse Engineer.

Tagged as: Comments Off
Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.