Onset Detection Part 6: Onsets & Spectral Flux

In an earlier post i talked about the scientific view of onset/beat detection. There’s a lot of different schemes out there that do the job more or less well. However, one approach was so intriguingly simple and performed so well compared to other more elaborate algorithms that i chose to use it for my purposes. It’s called spectral flux or spectral difference analysis and i’ll try to give you an insight of it workings in this article.

One of the best papers on the matter of onset detection is by Bello at al. which you can view here. I’ll review some of the concepts mentioned in there here and will steal a couple of images.

First of what is an onset? Looky:

At the top we see the sample amplitudes or wave form of a single note. The onset is the first thing to happen when playing a note, followed by an attack until it reaches its maximum amplitude followed by a decay phase. I refrain from defining the transient as it is of no concern for us.

The audio signals we want to process are of course a tiny bit more complex than a single note. We want to detect onsets in polyphonic music containing pitched instruments like guitars or violins, vocals and non pitched drums. As we have seen in the frequency chart a couple of articles back most of the pitched instruments overlap in their frequency range (mostly between 70Hz and 4000Hz). Perfect onset detection is thus impossible, however we can try to get an approximate estimation.

From what i could gather from the papers i read on the topic onset detection almost always follows the same schema. First you read in your audio signal, next you transform it to a onset detection function (just another array of floats corresponding to your original audio samples in some way) and finally you pick the peaks in this detection function as onsets/beats. The first stage we already talked about and it turned out to be pretty easy. The second stage of the process is often pretty elaborate doing some magic to transform the audio signal into another simpler, compressed one dimensional function. The last stage is not covered in detail in most of the literature. However, while i was working on a prototype it turned out that this stage is probably the most important one, even more important than the onset detection function calculation. In a future article we’ll concentrate on peak detection, for now let’s stick to onset detection functions.

Here’s a nice little chart for the general process of onset detection, again taken from Bello et al.

The pre-processing stage could for example do dynamic range compression, that is making everything equally loud, a shitty habit used in almost all mixes of modern day music. However shitty DRC might sound it is benefial for some onset detection functions. We won’t use one that benefits from it so we don’t care for pre-processing.

The reduction phase is equal to constructing the onset detection function. The goal of this stage is it to get a compressed view of the structure of the musical signal. The onset detection function should be able to catch musical events and make it easier to detect onsets as compared to trying that on the plain wave form of the audio signal. The output of the onset detection function are values indicating the presence or absence of a musical event for successive time spans. Remember how we read in samples for playback. We read in 1024 samples in each iteration. For such a sample window an onset detection function would output a value. For one second of music sampled at 44100Hz and using 1024 samples as the time span for the onset detection function we’d get roughly 43 values (44100/1024) that model the structure of the audio signal. Coincidentially the sample window size of 1024 can be processed by our FFT class. To summarize: generating the values of the onset detection function means calculating a value for each successive sample window which gives us a hopefully nice curve we perform peak detection on.

Now, i introduced the FFT in the last article so we are going to use it for generating the onset detection function. What’s the motivation for this? Listen to the following excerpt from “Judith” by “A Perfect Circle” and have a look at the image below:


The image depicts the spectra of successive 1024 sample windows of the audio signal. Each vertical pixel colum corresponds to one spectrum. Each pixel in a column stands for a frequency bin. The color tells us how much this frequency bin contributes to the audio signal in the sample window the spectrum was taken from. The values range from blue to red to white, increasing in that order. For reference the y-axis is labeled with the frequencies, the x-axis is labeled with the time. And holy shit batman we can see beats! I took the freedom to mark them with awesome arrows in Paint. The beats are visible only in the high frequency range from a bit about 5000Hz to 16000Hz. They correspond to the hi-hat being played. When you hit a hi-hat it produces something called white noise, a sound that spans the compelte frequency spectrum. Compared to other instruments it has a very high energy when hit. And we can clearly see that in the spectrum plot.

Now, processing this spectrum plot wouldn’t be a lot easier than going for the wave form. For comparison here’s the waveform of the same song excerpt:

We can see the hi-hat hits in the waveform too. So what’s the benefit of going to Fourier transform route? Well, it allows us to focus on specific frequencies. For example, if we are only interested in detecting the beats of the hi-hat or the crash or ride we’d go for the very high frequency range only as this is were those instruments will leave a note of their presence. We can’t do that with the amplitudes alone. The same is true for other instruments like the kick drum or the bass which we’d look for in the very low frequencies only. If we inspect multiple frequency ranges of the spectrum seperately we call it multi-band onset detection. Just to give you some keywords for Mr. Google.

So we know why we do our onset detection in the frequency domain but we are not any wiser on how to compress all the information of the spectrum plot to a managable one dimensional array of floats we can easily do peak detection on. Let me introduce you to our new best friend the spectral difference vulgo spectral flux. The concept is incredibly simple: for each spectrum we calculate the difference to the spectrum of the sample window before the current spectrum. That’s it. What we get is a single number that tells us the absolute difference between the bin values of the current spectrum and the bin values of the last spectrum. Here’s a nice little formula (so i get to use the latex wordpress plugin):

[latex size=”2″]SF(k) =\displaystyle \sum_{i=0}^{n-1} s(k,i) – s(k-1,i)[/latex]

SF(k) gives us the spectral flux of the kth spectrum. s(k,i) gives us the value of the ith bin in the kth spectrum, s(k-1,i) is analogous for the spectrum before k. We substract the values of each bin of the previous spectrum from the values of the corresponding bin in the current spectrum and sum those differences up to arrive at a final value, the spectral flux of spectrum k!

Let’s write a small program that calculates and visualizes the spectral flux of an mp3:

We start of by instantiating a decoder that will read in the samples from an mp3 file. We also instantiate an FFT object telling it to expect sample windows of 1024 samples and a sampling rate of 44100Hz. Next we create a float array for the samples we read from the decoder. We use a window size of 1024 samples. The next two lines instantiate auxiliary float arrays that will contain the current spectrum and the last spectrum. Finally we also need a place to write the spectral flux for each sample window to, an ArrayList will do the job here.

The following loop does the nasty job of calculating the spectral flux’. First we read in the next 1024 samples from the decoder. If that succeeded we calculate the Fourier transform and spectrum of this sample window via a call to fft.forward(samples). We then copy the last current spectrum to the array lastSpectrum and the spectrum we just calculated to the array spectrum. What follows is the implementation of the formula above. For each bin we substract the value of the last spectrum’s bin from the current spectrum’s bin and add it to a sum variable called flux. When the loop has finished flux contains… the spectral flux of the current spectrum. This value is added to the spectralFlux ArrayList and we continue with decoding the next sample window. We repeat this for each sample window we can read in from the decoder. After the loop is done spectralFlux contains a spectral flux for each sample window we read in. The last three lines just create a plot and visualize the result. Note: i created a small helper class called PlaybackVisualizer which takes a plot, the number of samples per pixel and a decoder that will playback the audio fromt he decoder while setting the marker in the plot correspondingly. Nothing to complex, we saw how to do that in one of the earlier articles. Here’s the output for the song “Judith” we used above already:

Hurray! we can again see peaks. This time they not only correspond to the hi-hat but also to the snare and the kick drum. The reason for this is that use the complete spectrum so the spectral flux values are governed by the instruments that have the most energy, in this case hi-hat, snare and kick drum. Now let’s review what we just did. Take the original spectrum image from above were i marked the hits of the hi-hat. What the code does is essentially creating a sum over the differences of the bins of two successive spectra. When the current spectrum has more overall energy than the previous spectrum the spectral flux function will rise. If the current spectrum has less energy than the previous spectrum the spectral flux function will fall. So we successfully mapped the 2D spectrum plot to a 1D function we can perform peak detection on. Well, sort of…

Before we do any peak detection we have to clean up the spectral flux function a little bit. Let’s start with ignoring negative spectral flux values. We are not interested in falling spectral flux but only in rising spectral flux. The process of ignoring the spectral flux values that are negative is called rectifying. Here’s the modified loop for calculating the spectral flux:

And the corresponding output:

A bit nicer. Rectifying the spectral flux will aid us when we calculate the a threshold function for peak detection.

Another clean up process is using a so called Hamming window. This Hamming window is applied to the samples before calculating the Fourier transform. It’s basically a smoothing function that will make our samples look a bit nicer. The FFT class has a function by which we can enable Hamming window smoothing:

That’s all there is to do, here’s the result:

Not a big difference but we can see how it equalizes the first 4 hits a little bit while removing one at the end of the sample. In other songs the hamming window has a bigger effect on the resulting spectral flux but it’s really up to you wheter to use it or not. It doesn’t cost a lot of performance.

With this i want to conclude this article. We have seen how to go from samples to spectrums to calculating a 1D function that we’ll later use to detect the peaks and therefore onsets in an audio signal. From the images above we can already see a lot of peaks which seem easy to detect. In the next article we’ll go a step further and calculate the spectral flux of different frequency bands. By this we hope to be able to extract various instruments as oposed to focusing on the loudest thing in the complete spectrum. We’ll also employ a technique called hopping where we calculate the spectral flux not on overlapping successive sample windows instead of non-overlapping sample windows. This will further smooth the resulting spectral flux function a little bit. We are then ready to create a simple but effective peak detector that gives us the final onset times.

As always you can find all the source code at http://code.google.com/p/audio-analysis/. Over and out.

Onset Detection Part 5: The (Discrete) Fourier Transform

Warning: this is how i understood the discrete Fourier transform which might be wrong in a couple of places. For a correct explanation of the fourier transform i suggest reading http://www.dspguide.com/ch8.htm

So finally we’ve come to the dreaded Fourier transform article. Let’s review some of the things we already know. The first thing to remember is that our audio data was fetched from some kind of recording device or synthesized digitally. In any case what we have is measurements of amplitudes for discrete points in time which we call samples. We can either have mono or stereo samples, in the later case we just have two samples per point in time instead of only one. The sample frequency is the number of samples we measured per second (for stereo we count both samples for the left and right channel as a single one in this case). Here’s 1024 samples of the note A (440Hz) at a sampling rate of 44100Hz :

Note: the vertical line to the left is not part of the actual image. It’s wordpress fucking up

1024 samples at a sampling rate of 44100Hz equals a time span of 43,07ms. From this image we can see that the samples form something that resembles a sine wave pretty much. And in fact the samples were generated using Math.sin() as we did in the first article of this series, confirming that sound is nothing but a mixture of sine waves of different frequency and amplitude.

A smart mathemagicion called Fourier once proposed the theorem that all possible functions could be approximated by a series of so called sinusoids, that is sine waves with different frequencies and amplitudes (actually Gauss already knew this but Fourier got the fame).  He did all this magic for so called continuous functions which you will remember from your analysis course in school (you do remember analysis, right?). This was extended to discrete functions and guess what, our samples are just that, a discrete function in time. Now, a Fourier transform decomposes a signal into a set of sinusoids with different frequencies and amplitudes. If we apply a Fourier transform to a discrete audio signal we get the frequency domain representation of our time domain audio signal. This transform is invertable, so we can go one or the other way. From now on we only consider discrete Fourier transforms on audio signals to not overcomplicate matters.

What does the frequency domain of our time domain signal tell us? Well, put simply it tells us what frequencies contribute how much to the time domain signal. A frequency domain signal can thus tell us wheter there’s the note a at 440Hz in the audio signal and how loud this note is in respect to the complete tune. This is pretty awesome as it allows us to investigate specific frequencies for our beat detection purposes. From the frequency chart of the last article we know that various instruments have different frequencies they can produce. If we want to focus on say singing we’d have a look at the frequency range 80Hz to 1000Hz for example. Of course instruments overlap in their frequency ranges so that might lead to some problems. But we’ll care for that later.

Now, what we’ll use for our frequency analysis is called the discrete Fourier transform. Given a discrete time signal, e.g. samples measured at a specific sampling rate we get a discrete frequency domain out. As an intermediate stage we get the so called real and imaginary part of the frequency domain. There is some scary terms in the last sentence so let’s just go over that quickly. The fourier transform decomposes a time signal into coefficients for sinusoids. The real part holds the Fourier coefficients for cosines of specific frequencies (a cosine is a sinusoid) as well as for the sines of the same frequencies. For our use case this representation is not all that important. What we will use is the so called spectrum which is calculated from the real and imaginary parts of the original fourier transform.  The spectrum tells us for each frequency how much the frequency contributes to the original time domain audio signal. From now on we assume that we already have calculated this spectrum, which in the framework is done by a class called FFT (taken from Minim) which relieves some of the pain.  If you want to understand how to go from the real and imaginary parts of the original Fourier transform to the spectrum read the link above.

Not all possible frequencies will be present in the spectrum due to the discrete nature of the transform. Frequencies are binned, that is multiple frequencies are merged together into a single value. Also, there’s a maximum frequency the discrete Fourier transform can give us which is called the Nyquist frequency. It is equal to half the sampling rate of our time domain signal. Say we have an audio signal sampled at 44100Hz the Nyquist frequency is 22050Hz. When we transform the signal to the frequency domain we thus get frequency bins up to 22050Hz. How many of those bins are there? Half the number of samples plus one bin. When we transform 1024 samples we get 513 frequency bins each having a bandwidth (range of frequencies) of the Nyquist frequency divided by the number of bins except for the first and last bin which have half that bandwidth. Say we have 1024 samples we transform, sampled at 44100Hz. The first bin will represent frequencies 0 to 22050 / 513 / 2~ 21.5Hz (remember, the first bin has half the normal bandwidth). The next bin will represent the frequencies from 21.5Hz to 21.5Hz plus 22050 / 513 ~ 43Hz == 64.5 Hz (one more time, 21.5Hz to 64.5Hz, bandwidth is 43Hz). The next bin ranges from 64.5Hz to 107.5Hz and so on and so forth.

Whenever we do a discrete Fourier transform on our audio data we do it on a window of samples, e.g. 1024 samples is a common window size. The size of the window dictates the resolution of the resulting spectrum, that is the number of frequency bins in the spectrum will increase linearly with the number of samples we transform. The bigger the sample window the more fine grained the bins will be, that is the smaller their bandwidth. To calculate the spectrum we use a specific algorithm called the Fast Fourier Transform (FFT) which runs in O(n log n), pretty fast compared to a naive fourier transform implementatin which takes O(n*n). Almost all of the FFT implementations demand sample windows with a size being a power of two. So 256, 512, 1024, 2048 is fine, 273 is not. The FFT implementation we use in the framework also has this “limitation”.

So without further ado i present you the spectrum of the samples depicted in the image above

Note: the vertical lines to the left and right are not part of the actual image. It’s wordpress fucking up

Yes, that’s it. The x-axis corresponds to the frequency bins while the y-axis tells us the amplitude of a frequency bin. We clearly see a peek at the left side, the rest of the spectrum is zero. This peak corresponds to the frequency bin which contains the frequency 440Hz, the frequency of the note A we generated 1024 samples for. What we also see is that it’s not a clean peak, bins to the left and right also receive some amplitude. This is called leakage and is something we can’t solve. In our case it’s not a big concern really but in other application scenarios it might be. Here’s the program that generated the above two images:

The first couple of lines should be familiar, we simply generate 1024 samples at a sampling rate of 44000Hz. There’s actually only 2 interesting lines in this program. The first one is the one where we instantiate the FFT object. As parameters for it’s constructor it wants the sample window size we’ll use to generate the spectrum as well as the sampling rate. The next line performs the Fourier transform as well as the spectrum calculation. All we have to do is pass in a float array of samples having the size we specified in the FFT constructor. That’s all we need to do to get our precious frequency bins. What’s left is plotting the samples as well as the spectrum. Note the call to fft.getSpectrum(). This method will return the spectrum of the last fourier transform we did by calling FFT.forward().

Pfuh, that was pretty nasty. I really suggest reading up on the discrete Fourier transform. It’s an essential mathematical tool for all audio engineers. And while most audio engineers will use some sort of toolbox as we did it doesn’t hurt to know your way around.

We now are equipped with all we need to get our onset/beat detector going. Stay tuned for the next article.

Onset Detection Part 4.5: What to expect

As i said in part one of this series sound can be broken down into various frequencies that take part in the overall sound of song for example. In the next article we are going to dive into this topic in detail but for this article there’s something you need to know about frequencies first: Every instrument has a frequency range, the same is true for vocals. This a pitched instruments/voices. Drums and other percussive instruments are often non-pitched and spread over the whole frequency spectrum. Here’s a very interesting char taken from Wikipedia:

Before we dive into fourier transforms and similar topics i want to give you some impression of what to expect from the onset detector we develop here. For this i prepared a couple of videos for different genres that show what our onset detector will produce. Without explaining the details: in each video you will see 3 plots, the first for low frequencies, the next for mid frequencies and the last one at the bottom for very high frequencies. The red lines gives us an idea where an onset might be located. The green lines is a running average. Any red peak above a green line is a potential onset/beat. With that said let’s see how the detector works on various genres. Let’s start with Pop.

A pretty well known song i guess (yack…). In all three ranges the onsets are clearly visible. This comes from the fact that the song is largely composed of synthesized instruments. Especially the bass and drums leave their impression on the detector and it seems to work pretty well for this genre. On to the next genre: Rock

A song form the all-star band “A perfect circle” called weak and powerless. The drums are again pretty influencial on the overall onset/beat landscape. The detector does it’s job pretty good in this case too. Rock is a bit more difficult for any onset detection schema due its higher noise level. Distorted guitars and basses polute all frequencies pretty heavily and make the detection task a bit harder. However, if the sound engineer responsible for mixing the final song did his job well the frequencies of those instruments that dictate the rythm should stand out and be detectable by our algorithm. Next we have a look at acoustic rock:

A song by “Tenacious D” called Explosivo which starts of with an acoustic guitar and then goes into a more heavier part. The detector picks up the strokes of the guitar pretty well in the beginning. When the heavy part kicks in the guitar get’s into the background and the drums become the most influential rythmic section. In the top plot we also can see the drummer playing the kick pretty fast which i normally didn’t hear all that clearly without having the plot. Also note how the hi-hat fades in in the bottom plot before the heavy part starts. All drum parts like hi-hat, cymbals and so on will be recogniseable in that plot.

A pure acoustic song, performed by yours truely. The detector picks up the guitar extemely well. You can see the 3/4 rythm easily in all three plots. The voice does not have a big impact on the detector.

Jazz is said to be extremely hard for onset detectors. A lot of notes a player simultaniously or near simultaniously. Also, a lot of the frequency range is used up. In this example the instrument governing the rythm is again the drums as well as a little bit of bass which are picked up by the detector. However, in the beginning it’s the piano which does the rythm section. Again the detector does its job pretty well. Note how the clapping at the beginning makes the detector go a bit crazy. The clapping is basically just random noise as far as the detector is concerned, so he tries his best to pick what might be rythmical patterns.

Classic is the nightmare of every onset detector, especially when the composition is composed of string insturments only as in this play by Mozart. The beat detector picks up quiet some of the heavy attack strokes but anything with vibrato is a mess. It still performs better than Audiosurf in that regard though.

Finally some Metal by In-Flames. What’s interesting is that the snare has the most energy. Not so for the kick drum which is actually pretty silent compared to the rest of the song. If you’d look at the complete frequency spectrum of this song you’d see one big messy heatmap (at least i use a heatmap for frequency spectrums). The hi-hat is not really picked up by the detector that well as it is played open which produces a quiet noisy sound without a lot of energy.

The next video shows all the excerpts form above mixed down to a single mp3 and played in Audiosurf. The outcome might be different when feeding each excerpt seperately to Audiosurf but that would have been to tedious 🙂

For “Hit me baby one more time” Audiosurf generates blocks for the snare and sometimes for the kick. “Weak and Powerless” also produces blocks mostly for the hi-hat/cymbals/ride. “Explosivo” is a bit strange. At first sight it does not have any resamblance to the song. But if you look closer it becomes clear that the blocks are based on the singing! I guess Audiosurf takes the peaks which are the loudest, in “Explosivos” case this is the singing most of the time. The next song is a mix between singing and guitar strokes. For the jazz song the piano is not picked up in the beginning it seems but the clapping is. The rest is governed by the ride. The classic piece is complete fail in the beginning and gets better towards the end. This seems to coincide with the volume of the violins going up. The last piece picks up the crash most of the time with some snare sounds mixed in. All in all it seems that Audiosurf does a frequency band analysis for various bands and then picks the loudest band to generate the blocks. This works out pretty well for popular music but does not perform so well for classic music it seems. What’s interesting is that for cloud connected i really wonder how the crash got picked up as i played around in audacity high pass filtering it quiet a lot and still couldn’t see peaks. Strange indeed 🙂

The onset detector we develop in this series of articles seems to be able to pick up all the onsets audiosurf detects (if we adjust it a bit for singing if we want that which should be rather easy). Maybe we can even come up with something more robust that also handles classical pieces a bit better than Audiosurf does.