Onset Detection Part 7: Thresholding & Peak picking

In the last article we saw how to reduce a complex spectrum that evolves over time to a simple one dimensional function called the spectral flux. On the way we did some improvements to the spectral flux function like rectifying it and enabling Hamming smoothing of the audio samples. In this article we will talk about a so called threshold function which we derive from the spectral flux function.

That’s a rectified, Hamming smoothed spectral flux function. It’s for an excerpt of the song “Judith” by “A Perfect Circle” and already shows quiet some peaks which we could directly interpret as onsets. However, there’s still some noise in that function, especially when the crashs are hit for example. We can’t do any better filtering, like for example smoothing out the function a bit as this would get rid of a lot of the small peaks. Instead we do something totally simple: we calculate a threshold function which is derived from the spectral flux function. This threshold function is then later used to discard spectral flux values that are noise.

The threshold function is actually pretty simple. Given our spectral flux function we calculate the mean or average value of a window around each spectral flux value. Say we have 3000 spectral flux values, each specifying the spectral flux of a 1024 sample window. 1024 samples at a sampling rate of 44100Hz equal a time span of around 43ms. The window size we use for the threshold function should be derived from some time value, say we want the average spectral flux of a time span of 0.5 seconds. That’s 0.5 / 0.043 = 11 sample windows or 11 spectral flux values. For each spectral flux value we take the 5 samples before it, 5 samples after it and the current value and calculate their average. We thus get for each spectral flux value a single value we call the threshold (for reasons that will become apearant shortly). Here’s some code that does this:

Not a lot of new things in there. First we calculate the spectral flux function as we did in the last article. Based on this spectral flux function we then calculate the threshold function. For each spectral flux value in the ArrayList spectralFlux we take the THRESHOLD_WINDOW_SIZE spectral flux values before and after it and calculate the average. The resulting mean value is then stored in an ArrayList called threshold. Note that we also multiply each threshold value by MULTIPLIER which is set to 1.5 in this example. After we finished calculating all our functions we plot them. The result looks like this:

What we just did is calculating a running average of the spectral flux function. With this we can detect so called outliers. Anything above the threshold function is an outlier and marks an onset of some kind! It should also be clear why the threshold function values are multiplied by a constant > 1. Outliers must be bigger than the average, in this case 1.5 times bigger than the average. This multiplier is an important parameter of our onset detector as it governs the sensitivity. However, when i apply the detector to songs i try to come up with a single value that works for all of them. I actually did that and arrived at the magic 1.5 multiplier used above. It works well for all the samples that you can find in the svn as well as a lot of other songs i tested.

Now let us combine the spectral flux function and the threshold function. Basically we want a prunned spectral flux function that only contains values that are bigger or equal to the threshold function. We extend the above program by the following lines:

The variable prunnedSpectralFlux is just another ArrayList. The loop is pretty simple, we add zero to the prunnedSpectralFlux list of the current spectral flux is less than the corresponding threshold function value or we add the spectrul flux value minus the threshold value at position i. The resulting prunned spectral flux function looks like this:

Awesome, we are nearly finished! All that is left is to search for peaks in this prunned spectral flux. A Peak is a value that is bigger then the next value. That’s all there is to peak detection. We are nearly done. Let’s write the small code for the peak detection producing a peak ArrayList:

And that’s it. Any value > 0 in the ArrayList peaks is an onset or beat now. To calculate the point in time for each peak in peaks we simply take its index and multiply it by the time span that the original sample window takes up. Say we used a sample window of 1024 samples at a sampling rate of 44100Hz then we have the simple forumula time = index * (1024 / 44100). That’s it. Here’s the output:

We are done. Kind off. That’s the most basic onset detector we can write. Let’s review some of it’s properties. As with many systems we have a few parametes we can tweak. The first one is the sample window size. I almost always use 1024. Lowering this will result in a more fine grained spectrum but might not gain you much. The next parameter is wheter to use Hamming smoothing on the samples before doing the FFT on them. I did see a bit of improvement turning that on but it will also cost you some cycles calculating it. This might be of importance if you plan on doing this on a mobile device with not FPU. Every cycle saved is a good cycle there. Next is the threshold window size. This can have a huge impact on the outcome of the detection stage. I stick to a window of 20 spectal flux values in total as in the example above. This is motivated by calculating how big of a time span those 20 values represent. In the case of a sample window 1024 samples at a sampling rate of 44100Hz that’s roughly have a second worth of data. The value worked well on all the genres i tested the simple detector on but your mileage may vary. The last and probably most important parameter is the multiplier for the threshold function. To low and the detector is to sensitive, to high and nothing get’s cought. In my day job i sometimes have to do outlier analysis which is pretty similar to what we do here. Sometimes i use simple threshold based methods and a value between 1.3-1.6 seems to be some sort of magic number when doing statistics based outlier detection relative to some means. I’d say that the threshold multiplier and the threshold window size should be the only parameters you tweak. Forget about the rest and use the defaults of 1024 for sample window size and Hamming smoothing on. Doing a parameter seach for to many parameters is no fun so boiling it down to “only” two makes sense in this case. The defaults i gave you above work reasonably well with the detector we developed in this series. If you do multi-band analysis you might have to tweak them individually for each band.

What’s left is some improvements like doing this complete process not for the whole spectrum but sub-bands so that we can do onset detection for various instruments (sort of, as we know they overlap in frequency range). Another improvement is to use overlapping sample windows, say by 50%. This smooths out the Fourier transform a bit more and gives better results. One could also try to do dynamic range compression on the whole audio signal before passing it to the FFT and so on and so forth. The code in the framework allows you to experiment easily with various additional steps. You will also want to maybe clean up the resulting peak list and remove any peaks that are to close together, say < 10ms. Learning by doing is the big mantra. Now go out and write awesome music games! Here's a couple of papers and links i found very useful in the process of getting this to work: http://www.dspguide.com/ An awesome online book (also available as hard cover) called “The Scientist and Engineer’s Guide to Digital Signal Processing” featuring anything you’ll ever want to know about digital signal processing (as an engineer :)).
https://ccrma.stanford.edu/~jos/dft/ Another online book called “Mathematics of the discrete Fourier Transform with audio applications” coping with anything related to the discrete Fourier Transform. Pretty heavy on math but a very good read.
http://old.lam.jussieu.fr/src/Membres/Daudet/Publications_files/Onset_Tutorial.pdf A very good scientific paper called “A tutorial on onset detection for musical signals”. Features various onset detection functions (we used the un-banded spectral flux here) as well as some data on the performance of various functions on a standardized data set (Mirex 2006 Onset Detection Challenge). Look out for any other papers by Bello, he’s a big player in the field it seems.
http://www.dafx.ca/proceedings/papers/p_133.pdf “Onset detection revisited” by Dixon, another big name in the field (at least it seems so from the citation counts, but we all know how much they really tell you…). Very good paper, must read.

If you check out the citation graph of the two papers mentioned on google schoolar you will many more papers talking about onset detection that might interest you. Google *is* your friend in this case. Always read your fine papers!

Onset Detection Part 6: Onsets & Spectral Flux

In an earlier post i talked about the scientific view of onset/beat detection. There’s a lot of different schemes out there that do the job more or less well. However, one approach was so intriguingly simple and performed so well compared to other more elaborate algorithms that i chose to use it for my purposes. It’s called spectral flux or spectral difference analysis and i’ll try to give you an insight of it workings in this article.

One of the best papers on the matter of onset detection is by Bello at al. which you can view here. I’ll review some of the concepts mentioned in there here and will steal a couple of images.

First of what is an onset? Looky:

At the top we see the sample amplitudes or wave form of a single note. The onset is the first thing to happen when playing a note, followed by an attack until it reaches its maximum amplitude followed by a decay phase. I refrain from defining the transient as it is of no concern for us.

The audio signals we want to process are of course a tiny bit more complex than a single note. We want to detect onsets in polyphonic music containing pitched instruments like guitars or violins, vocals and non pitched drums. As we have seen in the frequency chart a couple of articles back most of the pitched instruments overlap in their frequency range (mostly between 70Hz and 4000Hz). Perfect onset detection is thus impossible, however we can try to get an approximate estimation.

From what i could gather from the papers i read on the topic onset detection almost always follows the same schema. First you read in your audio signal, next you transform it to a onset detection function (just another array of floats corresponding to your original audio samples in some way) and finally you pick the peaks in this detection function as onsets/beats. The first stage we already talked about and it turned out to be pretty easy. The second stage of the process is often pretty elaborate doing some magic to transform the audio signal into another simpler, compressed one dimensional function. The last stage is not covered in detail in most of the literature. However, while i was working on a prototype it turned out that this stage is probably the most important one, even more important than the onset detection function calculation. In a future article we’ll concentrate on peak detection, for now let’s stick to onset detection functions.

Here’s a nice little chart for the general process of onset detection, again taken from Bello et al.

The pre-processing stage could for example do dynamic range compression, that is making everything equally loud, a shitty habit used in almost all mixes of modern day music. However shitty DRC might sound it is benefial for some onset detection functions. We won’t use one that benefits from it so we don’t care for pre-processing.

The reduction phase is equal to constructing the onset detection function. The goal of this stage is it to get a compressed view of the structure of the musical signal. The onset detection function should be able to catch musical events and make it easier to detect onsets as compared to trying that on the plain wave form of the audio signal. The output of the onset detection function are values indicating the presence or absence of a musical event for successive time spans. Remember how we read in samples for playback. We read in 1024 samples in each iteration. For such a sample window an onset detection function would output a value. For one second of music sampled at 44100Hz and using 1024 samples as the time span for the onset detection function we’d get roughly 43 values (44100/1024) that model the structure of the audio signal. Coincidentially the sample window size of 1024 can be processed by our FFT class. To summarize: generating the values of the onset detection function means calculating a value for each successive sample window which gives us a hopefully nice curve we perform peak detection on.

Now, i introduced the FFT in the last article so we are going to use it for generating the onset detection function. What’s the motivation for this? Listen to the following excerpt from “Judith” by “A Perfect Circle” and have a look at the image below:


The image depicts the spectra of successive 1024 sample windows of the audio signal. Each vertical pixel colum corresponds to one spectrum. Each pixel in a column stands for a frequency bin. The color tells us how much this frequency bin contributes to the audio signal in the sample window the spectrum was taken from. The values range from blue to red to white, increasing in that order. For reference the y-axis is labeled with the frequencies, the x-axis is labeled with the time. And holy shit batman we can see beats! I took the freedom to mark them with awesome arrows in Paint. The beats are visible only in the high frequency range from a bit about 5000Hz to 16000Hz. They correspond to the hi-hat being played. When you hit a hi-hat it produces something called white noise, a sound that spans the compelte frequency spectrum. Compared to other instruments it has a very high energy when hit. And we can clearly see that in the spectrum plot.

Now, processing this spectrum plot wouldn’t be a lot easier than going for the wave form. For comparison here’s the waveform of the same song excerpt:

We can see the hi-hat hits in the waveform too. So what’s the benefit of going to Fourier transform route? Well, it allows us to focus on specific frequencies. For example, if we are only interested in detecting the beats of the hi-hat or the crash or ride we’d go for the very high frequency range only as this is were those instruments will leave a note of their presence. We can’t do that with the amplitudes alone. The same is true for other instruments like the kick drum or the bass which we’d look for in the very low frequencies only. If we inspect multiple frequency ranges of the spectrum seperately we call it multi-band onset detection. Just to give you some keywords for Mr. Google.

So we know why we do our onset detection in the frequency domain but we are not any wiser on how to compress all the information of the spectrum plot to a managable one dimensional array of floats we can easily do peak detection on. Let me introduce you to our new best friend the spectral difference vulgo spectral flux. The concept is incredibly simple: for each spectrum we calculate the difference to the spectrum of the sample window before the current spectrum. That’s it. What we get is a single number that tells us the absolute difference between the bin values of the current spectrum and the bin values of the last spectrum. Here’s a nice little formula (so i get to use the latex wordpress plugin):

[latex size=”2″]SF(k) =\displaystyle \sum_{i=0}^{n-1} s(k,i) – s(k-1,i)[/latex]

SF(k) gives us the spectral flux of the kth spectrum. s(k,i) gives us the value of the ith bin in the kth spectrum, s(k-1,i) is analogous for the spectrum before k. We substract the values of each bin of the previous spectrum from the values of the corresponding bin in the current spectrum and sum those differences up to arrive at a final value, the spectral flux of spectrum k!

Let’s write a small program that calculates and visualizes the spectral flux of an mp3:

We start of by instantiating a decoder that will read in the samples from an mp3 file. We also instantiate an FFT object telling it to expect sample windows of 1024 samples and a sampling rate of 44100Hz. Next we create a float array for the samples we read from the decoder. We use a window size of 1024 samples. The next two lines instantiate auxiliary float arrays that will contain the current spectrum and the last spectrum. Finally we also need a place to write the spectral flux for each sample window to, an ArrayList will do the job here.

The following loop does the nasty job of calculating the spectral flux’. First we read in the next 1024 samples from the decoder. If that succeeded we calculate the Fourier transform and spectrum of this sample window via a call to fft.forward(samples). We then copy the last current spectrum to the array lastSpectrum and the spectrum we just calculated to the array spectrum. What follows is the implementation of the formula above. For each bin we substract the value of the last spectrum’s bin from the current spectrum’s bin and add it to a sum variable called flux. When the loop has finished flux contains… the spectral flux of the current spectrum. This value is added to the spectralFlux ArrayList and we continue with decoding the next sample window. We repeat this for each sample window we can read in from the decoder. After the loop is done spectralFlux contains a spectral flux for each sample window we read in. The last three lines just create a plot and visualize the result. Note: i created a small helper class called PlaybackVisualizer which takes a plot, the number of samples per pixel and a decoder that will playback the audio fromt he decoder while setting the marker in the plot correspondingly. Nothing to complex, we saw how to do that in one of the earlier articles. Here’s the output for the song “Judith” we used above already:

Hurray! we can again see peaks. This time they not only correspond to the hi-hat but also to the snare and the kick drum. The reason for this is that use the complete spectrum so the spectral flux values are governed by the instruments that have the most energy, in this case hi-hat, snare and kick drum. Now let’s review what we just did. Take the original spectrum image from above were i marked the hits of the hi-hat. What the code does is essentially creating a sum over the differences of the bins of two successive spectra. When the current spectrum has more overall energy than the previous spectrum the spectral flux function will rise. If the current spectrum has less energy than the previous spectrum the spectral flux function will fall. So we successfully mapped the 2D spectrum plot to a 1D function we can perform peak detection on. Well, sort of…

Before we do any peak detection we have to clean up the spectral flux function a little bit. Let’s start with ignoring negative spectral flux values. We are not interested in falling spectral flux but only in rising spectral flux. The process of ignoring the spectral flux values that are negative is called rectifying. Here’s the modified loop for calculating the spectral flux:

And the corresponding output:

A bit nicer. Rectifying the spectral flux will aid us when we calculate the a threshold function for peak detection.

Another clean up process is using a so called Hamming window. This Hamming window is applied to the samples before calculating the Fourier transform. It’s basically a smoothing function that will make our samples look a bit nicer. The FFT class has a function by which we can enable Hamming window smoothing:

That’s all there is to do, here’s the result:

Not a big difference but we can see how it equalizes the first 4 hits a little bit while removing one at the end of the sample. In other songs the hamming window has a bigger effect on the resulting spectral flux but it’s really up to you wheter to use it or not. It doesn’t cost a lot of performance.

With this i want to conclude this article. We have seen how to go from samples to spectrums to calculating a 1D function that we’ll later use to detect the peaks and therefore onsets in an audio signal. From the images above we can already see a lot of peaks which seem easy to detect. In the next article we’ll go a step further and calculate the spectral flux of different frequency bands. By this we hope to be able to extract various instruments as oposed to focusing on the loudest thing in the complete spectrum. We’ll also employ a technique called hopping where we calculate the spectral flux not on overlapping successive sample windows instead of non-overlapping sample windows. This will further smooth the resulting spectral flux function a little bit. We are then ready to create a simple but effective peak detector that gives us the final onset times.

As always you can find all the source code at http://code.google.com/p/audio-analysis/. Over and out.

Onset Detection Part 5: The (Discrete) Fourier Transform

Warning: this is how i understood the discrete Fourier transform which might be wrong in a couple of places. For a correct explanation of the fourier transform i suggest reading http://www.dspguide.com/ch8.htm

So finally we’ve come to the dreaded Fourier transform article. Let’s review some of the things we already know. The first thing to remember is that our audio data was fetched from some kind of recording device or synthesized digitally. In any case what we have is measurements of amplitudes for discrete points in time which we call samples. We can either have mono or stereo samples, in the later case we just have two samples per point in time instead of only one. The sample frequency is the number of samples we measured per second (for stereo we count both samples for the left and right channel as a single one in this case). Here’s 1024 samples of the note A (440Hz) at a sampling rate of 44100Hz :

Note: the vertical line to the left is not part of the actual image. It’s wordpress fucking up

1024 samples at a sampling rate of 44100Hz equals a time span of 43,07ms. From this image we can see that the samples form something that resembles a sine wave pretty much. And in fact the samples were generated using Math.sin() as we did in the first article of this series, confirming that sound is nothing but a mixture of sine waves of different frequency and amplitude.

A smart mathemagicion called Fourier once proposed the theorem that all possible functions could be approximated by a series of so called sinusoids, that is sine waves with different frequencies and amplitudes (actually Gauss already knew this but Fourier got the fame).  He did all this magic for so called continuous functions which you will remember from your analysis course in school (you do remember analysis, right?). This was extended to discrete functions and guess what, our samples are just that, a discrete function in time. Now, a Fourier transform decomposes a signal into a set of sinusoids with different frequencies and amplitudes. If we apply a Fourier transform to a discrete audio signal we get the frequency domain representation of our time domain audio signal. This transform is invertable, so we can go one or the other way. From now on we only consider discrete Fourier transforms on audio signals to not overcomplicate matters.

What does the frequency domain of our time domain signal tell us? Well, put simply it tells us what frequencies contribute how much to the time domain signal. A frequency domain signal can thus tell us wheter there’s the note a at 440Hz in the audio signal and how loud this note is in respect to the complete tune. This is pretty awesome as it allows us to investigate specific frequencies for our beat detection purposes. From the frequency chart of the last article we know that various instruments have different frequencies they can produce. If we want to focus on say singing we’d have a look at the frequency range 80Hz to 1000Hz for example. Of course instruments overlap in their frequency ranges so that might lead to some problems. But we’ll care for that later.

Now, what we’ll use for our frequency analysis is called the discrete Fourier transform. Given a discrete time signal, e.g. samples measured at a specific sampling rate we get a discrete frequency domain out. As an intermediate stage we get the so called real and imaginary part of the frequency domain. There is some scary terms in the last sentence so let’s just go over that quickly. The fourier transform decomposes a time signal into coefficients for sinusoids. The real part holds the Fourier coefficients for cosines of specific frequencies (a cosine is a sinusoid) as well as for the sines of the same frequencies. For our use case this representation is not all that important. What we will use is the so called spectrum which is calculated from the real and imaginary parts of the original fourier transform.  The spectrum tells us for each frequency how much the frequency contributes to the original time domain audio signal. From now on we assume that we already have calculated this spectrum, which in the framework is done by a class called FFT (taken from Minim) which relieves some of the pain.  If you want to understand how to go from the real and imaginary parts of the original Fourier transform to the spectrum read the link above.

Not all possible frequencies will be present in the spectrum due to the discrete nature of the transform. Frequencies are binned, that is multiple frequencies are merged together into a single value. Also, there’s a maximum frequency the discrete Fourier transform can give us which is called the Nyquist frequency. It is equal to half the sampling rate of our time domain signal. Say we have an audio signal sampled at 44100Hz the Nyquist frequency is 22050Hz. When we transform the signal to the frequency domain we thus get frequency bins up to 22050Hz. How many of those bins are there? Half the number of samples plus one bin. When we transform 1024 samples we get 513 frequency bins each having a bandwidth (range of frequencies) of the Nyquist frequency divided by the number of bins except for the first and last bin which have half that bandwidth. Say we have 1024 samples we transform, sampled at 44100Hz. The first bin will represent frequencies 0 to 22050 / 513 / 2~ 21.5Hz (remember, the first bin has half the normal bandwidth). The next bin will represent the frequencies from 21.5Hz to 21.5Hz plus 22050 / 513 ~ 43Hz == 64.5 Hz (one more time, 21.5Hz to 64.5Hz, bandwidth is 43Hz). The next bin ranges from 64.5Hz to 107.5Hz and so on and so forth.

Whenever we do a discrete Fourier transform on our audio data we do it on a window of samples, e.g. 1024 samples is a common window size. The size of the window dictates the resolution of the resulting spectrum, that is the number of frequency bins in the spectrum will increase linearly with the number of samples we transform. The bigger the sample window the more fine grained the bins will be, that is the smaller their bandwidth. To calculate the spectrum we use a specific algorithm called the Fast Fourier Transform (FFT) which runs in O(n log n), pretty fast compared to a naive fourier transform implementatin which takes O(n*n). Almost all of the FFT implementations demand sample windows with a size being a power of two. So 256, 512, 1024, 2048 is fine, 273 is not. The FFT implementation we use in the framework also has this “limitation”.

So without further ado i present you the spectrum of the samples depicted in the image above

Note: the vertical lines to the left and right are not part of the actual image. It’s wordpress fucking up

Yes, that’s it. The x-axis corresponds to the frequency bins while the y-axis tells us the amplitude of a frequency bin. We clearly see a peek at the left side, the rest of the spectrum is zero. This peak corresponds to the frequency bin which contains the frequency 440Hz, the frequency of the note A we generated 1024 samples for. What we also see is that it’s not a clean peak, bins to the left and right also receive some amplitude. This is called leakage and is something we can’t solve. In our case it’s not a big concern really but in other application scenarios it might be. Here’s the program that generated the above two images:

The first couple of lines should be familiar, we simply generate 1024 samples at a sampling rate of 44000Hz. There’s actually only 2 interesting lines in this program. The first one is the one where we instantiate the FFT object. As parameters for it’s constructor it wants the sample window size we’ll use to generate the spectrum as well as the sampling rate. The next line performs the Fourier transform as well as the spectrum calculation. All we have to do is pass in a float array of samples having the size we specified in the FFT constructor. That’s all we need to do to get our precious frequency bins. What’s left is plotting the samples as well as the spectrum. Note the call to fft.getSpectrum(). This method will return the spectrum of the last fourier transform we did by calling FFT.forward().

Pfuh, that was pretty nasty. I really suggest reading up on the discrete Fourier transform. It’s an essential mathematical tool for all audio engineers. And while most audio engineers will use some sort of toolbox as we did it doesn’t hurt to know your way around.

We now are equipped with all we need to get our onset/beat detector going. Stay tuned for the next article.