A week ago I came across this interesting post at the Websense blog, anyway I guess everybody is already aware that a bot was spotted breaking Gmail's image captcha. According to the post, the success rate is about 20%, which from spammers point of view is really profitable and sure more than enough for its purposes. However what caught my attention, while reviewing the gmail signup page, was the Audio Captcha.
First off, it is worth noting the “cat&dog” Asirra captcha from Microsoft Research, that’s a really good captcha, has kept the success rate of those who tried to break it (computer vision gurus) below of 60%. Why? I think the problem with most of the captchas is that are using a complex solution to show so simple challenges: obfuscated, deformed and distorted image to represent short alphanumeric sequences. On the other hand we have the “cat&dog” style Captchas that implement a simple solution to show a really complex challenge for automated agents: Are you seeing cat or dogs in this perfectly clean picture? A question too hard to answer if you are not human.
The Gmail's Audio Captcha suffers a similar error. It is a wav file embedded within the webpage, once loaded it plays limited series of numbers . Twice. Btw, I don’t understand why that alphanumeric obsession...Anyway, let's begin. In this post I am going to show how that captcha can be broken just by using fourier analysis.
You should play the captcha before continuing Look for this image within the signup page.
The first obvious error is the use of fixed patterns that clearly identify where the sequence begins and where it ends.
We can listen to the numbers, in background there are distorted voices.Taking into account that human beings are visual entities ( this is the reason because everybody can spot Wally in a crowded place but only trained individuals could distinguish a distorted tone while an orchestra is playing) my question was: “If you are still capable of distinguishing easily the numbers played in the captcha, why an automated agent couldn’t do so?”
You can immediately recognize the first “beeps” that announce the initial sequence and at the half of image, which is also the half of the captcha more or less, the voice of the girl saying “once again”. Moreoever,you see the signal of every number played has its particular “shape”.
Well, that’s all. You have already broken the Gmail audio captcha. Now, having the concept in your mind, all you need to do is write down that idea into something more tangible and less volatile.
Jean Baptiste Fourier discovered, more than a hundred years before iPhone, that every waveform could be generated by adding up sine waves. But wait, he also found that every waveform could be broken down into those sine waves as well. So all we have is a bunch of sine waves that make up our signal. Let’s imagine a signal along the following axis
Viewing the graph along the time axis we would be in the time domain, on the other hand, if we get as reference the frequency axis we are talking about the frequency domain, what’s the difference? A lot, but the most obvious is that in the time domain you are choosing a “local” reference at every moment, on the other hand viewing the signal from the frequency domain, we get a “global” perspective of the signal. We represent every sine wave that makes up the signal as a vertical line as follows:
Those lines (components) uniquely identify our signal. In pattern recognition if you manage to extract good features from whatever you are mining, you’ll get the 50% of the work already done. In this case what we are mining is a wav file which is merely a vector of N samples representing a signal at t time. The higher sample rate, the more accuracy.
Ok, so the next step is to project that vector into the frequency domain by using the Discrete Fourier Transformation through the Fast Fourier Transformation algorithm.
However, there is something we have to note before proceeding, we are calculating the FFT on N samples that represent a sine wave over a certain time, this wave is not strictly required to be periodic in that interval so we need a method to “readjust” our samples in order to minimize the impact that a non-periodic sampled wave may produce, this is a well-known problem called “leakage”. In order to solve this issue we have to apply a Window function on the samples. Therefore, allowing the FFT to operate only on a certain selected interval we'll achieve more realistic results. The window function applied is one of the most common: Hanning (watch out, not hamming) .
Well, once the signal is in the frequency domain the next goal is to calculate its power spectral density since that is a good standpoint for identifying the signal as we’ll see later. The last step in this block is to adjust the scale. The human ear is ruled by the logarithmic scale so we wouldn’t be less than those “humans” or whatever…jokes aside we are modifying the amplitude scale since this is the best way to get a proper view of the small signals whilst the large ones are still represented as deserve
Spectrum sample of a random captcha.
Another view of signal’s Power differences in the time domain. The spectrogram.
As you can see in the Spectrum image, there are a a bunch of peaks and valleys. These peaks are what we are going to use as features for characterizing our sampled signals. Therefore the next step is to find out those peaks. In order to discover the local maximums we calculate the second derivative of a cubic equation specially chosen to represent the spectrum. Et voilá, those peaks can be used as features for any classification algorithm you desire. The method proposed is simple, throwing away HMMs, SVMs, ANNs et cetera… what we do is a “peak matching” algorithm along the entire input signal,firstly windowing the input captcha, at every iteration, to the FFT size of the signal of the number we are checking. Then calculating both spectrums and finally comparing the peaks obtained. By using this technique the percentage of success easily raises to 90% so imagine what would you do by implementing the proper classification algorithm.
Download a video showing a live demo of the tool developed to break the audio captcha. This tool is not going to be released for obvious reasons.
Under my point of view the main problems present in this audio captcha are the following:
- Slightly distorted signal over the frequency domain.
- Signals have an invariant duration along the time axis.
- Same voice.
- Fixed patterns at the init, middle and end of the captcha.
- Numeric sequence as proposed challenge. (maybe the most important one)
These weak features make the Gmail's Audio Catpcha highly suitable for automated attacks. Moreover there are several audio captchas out there, i.e facebook, that suffer the same inherited design errors as well.
Please note that some important steps have been deliberately omitted in order to avoid this post becomes an step-by-step tutorial for spammers.