In an earlier post i talked about the scientific view of onset/beat detection. There’s a lot of different schemes out there that do the job more or less well. However, one approach was so intriguingly simple and performed so well compared to other more elaborate algorithms that i chose to use it for my purposes. It’s called spectral flux or spectral difference analysis and i’ll try to give you an insight of it workings in this article.

One of the best papers on the matter of onset detection is by Bello at al. which you can view here. I’ll review some of the concepts mentioned in there here and will steal a couple of images.

First of what is an onset? Looky:

At the top we see the sample amplitudes or wave form of a single note. The onset is the first thing to happen when playing a note, followed by an attack until it reaches its maximum amplitude followed by a decay phase. I refrain from defining the transient as it is of no concern for us.

The audio signals we want to process are of course a tiny bit more complex than a single note. We want to detect onsets in polyphonic music containing pitched instruments like guitars or violins, vocals and non pitched drums. As we have seen in the frequency chart a couple of articles back most of the pitched instruments overlap in their frequency range (mostly between 70Hz and 4000Hz). Perfect onset detection is thus impossible, however we can try to get an approximate estimation.

From what i could gather from the papers i read on the topic onset detection almost always follows the same schema. First you read in your audio signal, next you transform it to a onset detection function (just another array of floats corresponding to your original audio samples in some way) and finally you pick the peaks in this detection function as onsets/beats. The first stage we already talked about and it turned out to be pretty easy. The second stage of the process is often pretty elaborate doing some magic to transform the audio signal into another simpler, compressed one dimensional function. The last stage is not covered in detail in most of the literature. However, while i was working on a prototype it turned out that this stage is probably the most important one, even more important than the onset detection function calculation. In a future article we’ll concentrate on peak detection, for now let’s stick to onset detection functions.

Here’s a nice little chart for the general process of onset detection, again taken from Bello et al.

The pre-processing stage could for example do dynamic range compression, that is making everything equally loud, a shitty habit used in almost all mixes of modern day music. However shitty DRC might sound it is benefial for some onset detection functions. We won’t use one that benefits from it so we don’t care for pre-processing.

The reduction phase is equal to constructing the onset detection function. The goal of this stage is it to get a compressed view of the structure of the musical signal. The onset detection function should be able to catch musical events and make it easier to detect onsets as compared to trying that on the plain wave form of the audio signal. The output of the onset detection function are values indicating the presence or absence of a musical event for successive time spans. Remember how we read in samples for playback. We read in 1024 samples in each iteration. For such a sample window an onset detection function would output a value. For one second of music sampled at 44100Hz and using 1024 samples as the time span for the onset detection function we’d get roughly 43 values (44100/1024) that model the structure of the audio signal. Coincidentially the sample window size of 1024 can be processed by our FFT class. To summarize: generating the values of the onset detection function means calculating a value for each successive sample window which gives us a hopefully nice curve we perform peak detection on.

Now, i introduced the FFT in the last article so we are going to use it for generating the onset detection function. What’s the motivation for this? Listen to the following excerpt from “Judith” by “A Perfect Circle” and have a look at the image below:

[audio:http://apistudios.com/hosted/marzec/badlogic/wordpress/wp-content/uploads/2010/02/judith.mp3]The image depicts the spectra of successive 1024 sample windows of the audio signal. Each vertical pixel colum corresponds to one spectrum. Each pixel in a column stands for a frequency bin. The color tells us how much this frequency bin contributes to the audio signal in the sample window the spectrum was taken from. The values range from blue to red to white, increasing in that order. For reference the y-axis is labeled with the frequencies, the x-axis is labeled with the time. And holy shit batman we can see beats! I took the freedom to mark them with awesome arrows in Paint. The beats are visible only in the high frequency range from a bit about 5000Hz to 16000Hz. They correspond to the hi-hat being played. When you hit a hi-hat it produces something called white noise, a sound that spans the compelte frequency spectrum. Compared to other instruments it has a very high energy when hit. And we can clearly see that in the spectrum plot.

Now, processing this spectrum plot wouldn’t be a lot easier than going for the wave form. For comparison here’s the waveform of the same song excerpt:

We can see the hi-hat hits in the waveform too. So what’s the benefit of going to Fourier transform route? Well, it allows us to focus on specific frequencies. For example, if we are only interested in detecting the beats of the hi-hat or the crash or ride we’d go for the very high frequency range only as this is were those instruments will leave a note of their presence. We can’t do that with the amplitudes alone. The same is true for other instruments like the kick drum or the bass which we’d look for in the very low frequencies only. If we inspect multiple frequency ranges of the spectrum seperately we call it multi-band onset detection. Just to give you some keywords for Mr. Google.

So we know why we do our onset detection in the frequency domain but we are not any wiser on how to compress all the information of the spectrum plot to a managable one dimensional array of floats we can easily do peak detection on. Let me introduce you to our new best friend the **spectral difference** vulgo **spectral flux**. The concept is incredibly simple: for each spectrum we calculate the difference to the spectrum of the sample window before the current spectrum. That’s it. What we get is a single number that tells us the absolute difference between the bin values of the current spectrum and the bin values of the last spectrum. Here’s a nice little formula (so i get to use the latex wordpress plugin):

[latex size=”2″]SF(k) =\displaystyle \sum_{i=0}^{n-1} s(k,i) – s(k-1,i)[/latex]

*SF(k)* gives us the spectral flux of the kth spectrum. *s(k,i)* gives us the value of the ith bin in the kth spectrum, *s(k-1,i)* is analogous for the spectrum before *k*. We substract the values of each bin of the previous spectrum from the values of the corresponding bin in the current spectrum and sum those differences up to arrive at a final value, the spectral flux of spectrum *k*!

Let’s write a small program that calculates and visualizes the spectral flux of an mp3:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
public class SpectralFlux { public static final String FILE = "samples/judith.mp3"; public static void main( String[] argv ) throws Exception { MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE ) ); FFT fft = new FFT( 1024, 44100 ); float[] samples = new float[1024]; float[] spectrum = new float[1024 / 2 + 1]; float[] lastSpectrum = new float[1024 / 2 + 1]; List<Float> spectralFlux = new ArrayList<Float>( ); while( decoder.readSamples( samples ) > 0 ) { fft.forward( samples ); System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length ); float flux = 0; for( int i = 0; i < spectrum.length; i++ ) flux += (spectrum[i] - lastSpectrum[i]); spectralFlux.add( flux ); } Plot plot = new Plot( "Spectral Flux", 1024, 512 ); plot.plot( spectralFlux, 1, Color.red ); new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) ); } } |

We start of by instantiating a decoder that will read in the samples from an mp3 file. We also instantiate an FFT object telling it to expect sample windows of 1024 samples and a sampling rate of 44100Hz. Next we create a float array for the samples we read from the decoder. We use a window size of 1024 samples. The next two lines instantiate auxiliary float arrays that will contain the current spectrum and the last spectrum. Finally we also need a place to write the spectral flux for each sample window to, an ArrayList will do the job here.

The following loop does the nasty job of calculating the spectral flux’. First we read in the next 1024 samples from the decoder. If that succeeded we calculate the Fourier transform and spectrum of this sample window via a call to *fft.forward(samples)*. We then copy the last current spectrum to the array *lastSpectrum* and the spectrum we just calculated to the array *spectrum*. What follows is the implementation of the formula above. For each bin we substract the value of the last spectrum’s bin from the current spectrum’s bin and add it to a sum variable called *flux*. When the loop has finished *flux* contains… the spectral flux of the current spectrum. This value is added to the *spectralFlux* ArrayList and we continue with decoding the next sample window. We repeat this for each sample window we can read in from the decoder. After the loop is done *spectralFlux* contains a spectral flux for each sample window we read in. The last three lines just create a plot and visualize the result. Note: i created a small helper class called PlaybackVisualizer which takes a plot, the number of samples per pixel and a decoder that will playback the audio fromt he decoder while setting the marker in the plot correspondingly. Nothing to complex, we saw how to do that in one of the earlier articles. Here’s the output for the song “Judith” we used above already:

Hurray! we can again see peaks. This time they not only correspond to the hi-hat but also to the snare and the kick drum. The reason for this is that use the complete spectrum so the spectral flux values are governed by the instruments that have the most energy, in this case hi-hat, snare and kick drum. Now let’s review what we just did. Take the original spectrum image from above were i marked the hits of the hi-hat. What the code does is essentially creating a sum over the differences of the bins of two successive spectra. When the current spectrum has more overall energy than the previous spectrum the spectral flux function will rise. If the current spectrum has less energy than the previous spectrum the spectral flux function will fall. So we successfully mapped the 2D spectrum plot to a 1D function we can perform peak detection on. Well, sort of…

Before we do any peak detection we have to clean up the spectral flux function a little bit. Let’s start with ignoring negative spectral flux values. We are not interested in falling spectral flux but only in rising spectral flux. The process of ignoring the spectral flux values that are negative is called rectifying. Here’s the modified loop for calculating the spectral flux:

1 2 3 4 5 6 7 |
float flux = 0; for( int i = 0; i < spectrum.length; i++ ) { float value = (spectrum[i] - lastSpectrum[i]); flux += value < 0? 0: value; } spectralFlux.add( flux ); |

And the corresponding output:

A bit nicer. Rectifying the spectral flux will aid us when we calculate the a threshold function for peak detection.

Another clean up process is using a so called Hamming window. This Hamming window is applied to the samples before calculating the Fourier transform. It’s basically a smoothing function that will make our samples look a bit nicer. The FFT class has a function by which we can enable Hamming window smoothing:

1 |
fft.window(FFT.HAMMING); |

That’s all there is to do, here’s the result:

Not a big difference but we can see how it equalizes the first 4 hits a little bit while removing one at the end of the sample. In other songs the hamming window has a bigger effect on the resulting spectral flux but it’s really up to you wheter to use it or not. It doesn’t cost a lot of performance.

With this i want to conclude this article. We have seen how to go from samples to spectrums to calculating a 1D function that we’ll later use to detect the peaks and therefore onsets in an audio signal. From the images above we can already see a lot of peaks which seem easy to detect. In the next article we’ll go a step further and calculate the spectral flux of different frequency bands. By this we hope to be able to extract various instruments as oposed to focusing on the loudest thing in the complete spectrum. We’ll also employ a technique called hopping where we calculate the spectral flux not on overlapping successive sample windows instead of non-overlapping sample windows. This will further smooth the resulting spectral flux function a little bit. We are then ready to create a simple but effective peak detector that gives us the final onset times.

As always you can find all the source code at http://code.google.com/p/audio-analysis/. Over and out.