In the last article we saw how to reduce a complex spectrum that evolves over time to a simple one dimensional function called the spectral flux. On the way we did some improvements to the spectral flux function like rectifying it and enabling Hamming smoothing of the audio samples. In this article we will talk about a so called threshold function which we derive from the spectral flux function.

That’s a rectified, Hamming smoothed spectral flux function. It’s for an excerpt of the song “Judith” by “A Perfect Circle” and already shows quiet some peaks which we could directly interpret as onsets. However, there’s still some noise in that function, especially when the crashs are hit for example. We can’t do any better filtering, like for example smoothing out the function a bit as this would get rid of a lot of the small peaks. Instead we do something totally simple: we calculate a threshold function which is derived from the spectral flux function. This threshold function is then later used to discard spectral flux values that are noise.

The threshold function is actually pretty simple. Given our spectral flux function we calculate the mean or average value of a window around each spectral flux value. Say we have 3000 spectral flux values, each specifying the spectral flux of a 1024 sample window. 1024 samples at a sampling rate of 44100Hz equal a time span of around 43ms. The window size we use for the threshold function should be derived from some time value, say we want the average spectral flux of a time span of 0.5 seconds. That’s 0.5 / 0.043 = 11 sample windows or 11 spectral flux values. For each spectral flux value we take the 5 samples before it, 5 samples after it and the current value and calculate their average. We thus get for each spectral flux value a single value we call the threshold (for reasons that will become apearant shortly). Here’s some code that does this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
public class Threshold { public static final String FILE = "samples/explosivo.mp3"; public static final int THRESHOLD_WINDOW_SIZE = 10; public static final float MULTIPLIER = 1.5f; public static void main( String[] argv ) throws Exception { MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE ) ); FFT fft = new FFT( 1024, 44100 ); fft.window( FFT.HAMMING ); float[] samples = new float[1024]; float[] spectrum = new float[1024 / 2 + 1]; float[] lastSpectrum = new float[1024 / 2 + 1]; List<Float> spectralFlux = new ArrayList<Float>( ); List<Float> threshold = new ArrayList<Float>( ); while( decoder.readSamples( samples ) > 0 ) { fft.forward( samples ); System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length ); float flux = 0; for( int i = 0; i < spectrum.length; i++ ) { float value = (spectrum[i] - lastSpectrum[i]); flux += value < 0? 0: value; } spectralFlux.add( flux ); } for( int i = 0; i < spectralFlux.size(); i++ ) { int start = Math.max( 0, i - THRESHOLD_WINDOW_SIZE ); int end = Math.min( spectralFlux.size() - 1, i + THRESHOLD_WINDOW_SIZE ); float mean = 0; for( int j = start; j <= end; j++ ) mean += spectralFlux.get(j); mean /= (end - start); threshold.add( mean * MULTIPLIER ); } Plot plot = new Plot( "Spectral Flux", 1024, 512 ); plot.plot( spectralFlux, 1, Color.red ); plot.plot( threshold, 1, Color.green ) ; new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) ); } } |

Not a lot of new things in there. First we calculate the spectral flux function as we did in the last article. Based on this spectral flux function we then calculate the threshold function. For each spectral flux value in the ArrayList *spectralFlux* we take the *THRESHOLD_WINDOW_SIZE* spectral flux values before and after it and calculate the average. The resulting mean value is then stored in an ArrayList called *threshold*. Note that we also multiply each threshold value by *MULTIPLIER* which is set to 1.5 in this example. After we finished calculating all our functions we plot them. The result looks like this:

What we just did is calculating a running average of the spectral flux function. With this we can detect so called outliers. Anything above the threshold function is an outlier and marks an onset of some kind! It should also be clear why the threshold function values are multiplied by a constant > 1. Outliers must be bigger than the average, in this case 1.5 times bigger than the average. This multiplier is an important parameter of our onset detector as it governs the sensitivity. However, when i apply the detector to songs i try to come up with a single value that works for all of them. I actually did that and arrived at the magic 1.5 multiplier used above. It works well for all the samples that you can find in the svn as well as a lot of other songs i tested.

Now let us combine the spectral flux function and the threshold function. Basically we want a prunned spectral flux function that only contains values that are bigger or equal to the threshold function. We extend the above program by the following lines:

1 2 3 4 5 6 7 |
for( int i = 0; i < threshold.size(); i++ ) { if( threshold.get(i) <= spectralFlux.get(i) ) prunnedSpectralFlux.add( spectralFlux.get(i) - threshold.get(i) ); else prunnedSpectralFlux.add( (float)0 ); } |

The variable *prunnedSpectralFlux* is just another ArrayList. The loop is pretty simple, we add zero to the *prunnedSpectralFlux* list of the current spectral flux is less than the corresponding threshold function value or we add the spectrul flux value minus the threshold value at position *i*. The resulting prunned spectral flux function looks like this:

Awesome, we are nearly finished! All that is left is to search for peaks in this prunned spectral flux. A Peak is a value that is bigger then the next value. That’s all there is to peak detection. We are nearly done. Let’s write the small code for the peak detection producing a peak ArrayList:

1 2 3 4 5 6 7 |
for( int i = 0; i < prunnedSpectralFlux.size() - 1; i++ ) { if( prunnedSpectralFlux.get(i) > prunnedSpectralFlux.get(i+1) ) peaks.add( prunnedSpectralFlux.get(i) ); else peaks.add( (float)0 ); } |

And that’s it. Any value > 0 in the ArrayList *peaks* is an onset or beat now. To calculate the point in time for each peak in *peaks* we simply take its index and multiply it by the time span that the original sample window takes up. Say we used a sample window of 1024 samples at a sampling rate of 44100Hz then we have the simple forumula *time = index * (1024 / 44100)*. That’s it. Here’s the output:

We are done. Kind off. That’s the most basic onset detector we can write. Let’s review some of it’s properties. As with many systems we have a few **parametes** we can tweak. The first one is the **sample window** size. I almost always use 1024. Lowering this will result in a more fine grained spectrum but might not gain you much. The next parameter is wheter to use **Hamming smoothing** on the samples before doing the FFT on them. I did see a bit of improvement turning that on but it will also cost you some cycles calculating it. This might be of importance if you plan on doing this on a mobile device with not FPU. Every cycle saved is a good cycle there. Next is the threshold window size. This can have a huge impact on the outcome of the detection stage. I stick to a window of 20 spectal flux values in total as in the example above. This is motivated by calculating how big of a time span those 20 values represent. In the case of a sample window 1024 samples at a sampling rate of 44100Hz that’s roughly have a second worth of data. The value worked well on all the genres i tested the simple detector on but your mileage may vary. The last and probably most important parameter is the multiplier for the threshold function. To low and the detector is to sensitive, to high and nothing get’s cought. In my day job i sometimes have to do outlier analysis which is pretty similar to what we do here. Sometimes i use simple threshold based methods and a value between 1.3-1.6 seems to be some sort of magic number when doing statistics based outlier detection relative to some means. I’d say that the threshold multiplier and the threshold window size should be the only parameters you tweak. Forget about the rest and use the defaults of 1024 for sample window size and Hamming smoothing on. Doing a parameter seach for to many parameters is no fun so boiling it down to “only” two makes sense in this case. The defaults i gave you above work reasonably well with the detector we developed in this series. If you do multi-band analysis you might have to tweak them individually for each band.

What’s left is some improvements like doing this complete process not for the whole spectrum but sub-bands so that we can do onset detection for various instruments (sort of, as we know they overlap in frequency range). Another improvement is to use overlapping sample windows, say by 50%. This smooths out the Fourier transform a bit more and gives better results. One could also try to do dynamic range compression on the whole audio signal before passing it to the FFT and so on and so forth. The code in the framework allows you to experiment easily with various additional steps. You will also want to maybe clean up the resulting peak list and remove any peaks that are to close together, say < 10ms. Learning by doing is the big mantra. Now go out and write awesome music games!
Here's a couple of papers and links i found very useful in the process of getting this to work:
http://www.dspguide.com/ An awesome online book (also available as hard cover) called *“The Scientist and Engineer’s Guide to Digital Signal Processing”* featuring anything you’ll ever want to know about digital signal processing (as an engineer :)).

https://ccrma.stanford.edu/~jos/dft/ Another online book called *“Mathematics of the discrete Fourier Transform with audio applications” coping with anything related to the discrete Fourier Transform. Pretty heavy on math but a very good read.
http://old.lam.jussieu.fr/src/Membres/Daudet/Publications_files/Onset_Tutorial.pdf A very good scientific paper called “A tutorial on onset detection for musical signals”. Features various onset detection functions (we used the un-banded spectral flux here) as well as some data on the performance of various functions on a standardized data set (Mirex 2006 Onset Detection Challenge). Look out for any other papers by Bello, he’s a big player in the field it seems.
http://www.dafx.ca/proceedings/papers/p_133.pdf “Onset detection revisited” by Dixon, another big name in the field (at least it seems so from the citation counts, but we all know how much they really tell you…). Very good paper, must read.*

If you check out the citation graph of the two papers mentioned on google schoolar you will many more papers talking about onset detection that might interest you. Google *is* your friend in this case. Always read your fine papers!

Simply excellent article. So simple and still explains the main concepts so that you can really follow the idea. I’ll try to implement this algorithm on biological data. Thanks!

Glad if it is of any help ðŸ™‚

“1024 samples at a sampling rate of 44100Hz equal a time span of around 43ms”

1024 samples / 44100 samples/s = 23 ms

Oh dear, yes indeed ðŸ™‚ Thanks for pointing that out, changing.

hi,

Could you plec give me the code for peaks detection? I am recording a wav from microphone, and implement fft to detect peaks. But i don;t get all the peaks because of an amplitude threshold which is to high or to low ( then i detect too many peaks)

An ideea would be really helpfull. I use radix 2 for fft.

Need help pls

Nice plots! I am working on the visualization of a rhythm game.

Do you have another tutorial on how you did the plot?

Or is there any webpage I could use as reference?

Thanks a lot!

I am terribly sorry. The answer is in your code. Apology for asking useless question.

This series of tutorial articles is simply great! I love it!

But there is one little thing I don’t understand clearly, namely this part:

“I stick to a window of 20 spectal flux values in total as in the example above. This is motivated by calculating how big of a time span those 20 values represent. In the case of a sample window 1024 samples at a sampling rate of 44100Hz thatâ€™s roughly have a second worth of data.”

Where is a window with a size of 20? At the avarage calculation, the “window” is 11 samples long, and 1024 at the FFT computation. So, what window is 20 (what?) long? Where is that “example”?

Thanks!

Attila

Oh, and I forgot: How does it worth about a second? 1024 / 44100 * 20 = 0.4644 (sec), is it?

Hi! This was a really GREAT article! I dont know why I wasnt able to come upon this a bit sooner. Currently, I am also working on Onset Detection but only with the Time-Domain. I have tried various filters to try and improve and just seems to yield not so good results, either too many ghost notes or too many missed notes. This will really help me because I just didnt know before this how to work with the spectral energy, which so often comes up in the papers that I read. Thanks very much for this.

I do have a question, regarding: “As with many systems we have a few parametes we can tweak. The first one is the sample window size. I almost always use 1024. Lowering this will result in a more fine grained spectrum but might not gain you much.”

Isn’t increasing the blocksize for the FFT result in the more fine grained spectrum? Because I use 8192 blocksize for my FFT, since I am also doing Pitch Detection, so I would just like to clarify this point.

Thanks again!

Excellent Article.

Thanks.

If you are ever near Edinburgh, I’ll buy you a pint. Or two. Awesome stuff! Thanks so much!

Looking at this code:

float mean = 0;

for( int j = start; j <= end; j++ )

….mean += spectralFlux.get(j);

mean /= (end – start);

I believe it should be: mean /= (end – start + 1);

Also, in degenerate case, (end – start) could be zero.

Your images are missing.

Awesome job Mario! Thank you very much for this series, helped me a lot… I’m doing something with this right now, and what I was looking for is to get bpm from these onsets, there is a simple way to achieve this? It would be nice to reuse my already calculated onsets, instead of looking for another bpm detect algorithm. Thanks in advance!!

Awesome article, a really good read! Thanks Mario!