Onset Detection Part 1: The Basics

With this article i want to start a small step by step series which will allow you to build your own onset detector for your music game needs. We’ll start with the basics. Some thing like reading in various music file formats i’ll leave to you as there is a lot of material out there for this already. What i will present are some of the steps i used to get my onset detector going. Here’s a couple of videos i took with fraps today which show you 90% of the beat detector for various genres:

King Crimson – Three of a perfect pair

Some Mozart

Make sure to watch the videos in HD as you won’t be able to see the details otherwise. Also, Fraps fucked up a lot of times so there’s a lot of hangs in there. What this videos show you are 3 detection functions on 3 frequency subbands. We’ll see what that means later in this series. For now all you need to know that each red peak above a green line is essentially an onset, be it a drum onset, a guitar onset or a violing onset. The system is adaptive to the genre of music automatically and will pick the most representative rythmic onsets it can find. I checked around 20 different songs with this detection scheme and it worked out very well. Additionally it’s extremely simple to implement. Let’s start with the basics though.

Sound can be thought of as waves within a medium like air, water and so on. When we record sound we record the pressure of the medium on the microphone, more precisely the waves amplitude. Digital recording goes one step further as it has to discretize this recording process. The microphone is checked at a very high frequency, the higher the frequency the better the quality of the recording. This frequency is also known as sampling rate. Standard CD audio has a sampling rate of 44100Hz (Herz). This means that the we get a measurement of amplitude of our microphone 44100 times per second. Measuring in this context is usually called sampling, each sampling step gives us one so called sample. The sample represents the amplitude measured by the microphone at the given time. The value of a sample has to be stored somehow. This is were the so called bit-depth and encoding comes into play. The bit-depth tells us how many bits are used to represent the amplitude value of a sample. This could be a single byte, a short (2 bytes) or a word (4 byte). Anything above is usually not used. We of course also have to define what value range the amplitude can have. It can be signed or unsigned, most often signed formats are used. A sample can also be stored either as an integer or a normalized floating point value. Integer samples have a range dependant on the number of bytes used for a sample. A 16-bit signed sample would therefor have a value range of −32,768 to 32,767. Float samples are almost always normalized to a range between -1 and 1 which makes working with them a bit easier than with their integer counter parts. We can easily convert between 16-bit signed integer samples and 32-bit float samples by division and multiplication (i leave the details to you :)).

To get a feeling for how much storage is needed for pure PCM data (== sample in some bit-depth with some encoding, e.g. integer, float etc.) Let’s assume a song in mono (only one PCM channel) with a sampling rate of 44100Hz. The samples are stored as 16-bit signed integers. The song has a total runtime of 180 seconds. For each second of music we have 44100 samples due to the sampling rate. Each sample takes up 16-bit or 2 bytes. That’s 88200 bytes for a second of music or roughly 88Kb. The song is 180 seconds in length so we have 88100*180 = 15876000 bytes or roughly 15MB. Wow, quiet a lot of data for such a punny song. If we store the samples as 32-bit floats we double the needed memory, so that’s 30MB.

Most popular formats therefor use some kind of compression for the PCM data to get that size down to a managable and transferable size. Formats like MP3 or Ogg Vorbis do this job quiet well at the cost of losing some of the original quality. The audio compression algorithms of these formats are lossy, meaning the original data can not be recovered a 100% after decompression. The compression algorithms are constructed so that this loss is not audible to most listeners.

When we want to analyse music a compressed format is of no use to us. We have to get the straight PCM data somehow. That’s what decoders are for. There’s a gazillion of libraries out there in Java, C, C#, what have you that do this job for you. I won’t go into details on how to use a specific decoder here as there’s a lot of material on that matter out there. For MP3s there’s libmad and libmpg123 written in C out there, in Java you have JLayer. For Ogg Vorbis there’s libvorbis for C and Jorbis for Java. I urge you to chose your poison and read the documentation and example codes that come with all of those libraries. Most of them are really rather easy to use.

Let’s assume we have the PCM data in memory, fetched from a decoder. Most songs will be in stereo so the first thing we do is merging the two channels by averaging each left sample with it’s corresponding right sample. The PCM data will also probably be in 16-bit signed integer format so we’ll also convert this to a float array. The result of our function will be a float array which holds the n successive PCM samples. Here’s a bit of code

We put in two short arrays, one holding the samples for the left channel the other for the right channel, average the two channels and divivde the averaged sample by 32768.0f to get the amplitude in the range [-1,1] as a float (technically we should divide negative samples by 32768 and positive samples by 32767 due to the signed range of a short but we are lazy).

So what can we do with this now? Let’s first see how we can get the sample for a specific point in time, say 10 seconds into the song. To do this we need to know the sampling rate. The decoder library will happily provide us with this information, usually we’ll get a value of 44100Hz for almost all MP3s and Oggs. From now on we’ll assume a fixed sampling rate of 44100Hz. We remember that the sampling rate tells us how many samples have been taken per second. At 44100Hz each sample encodes 1 / 44100th of a second or ~0.00002 seconds. If we have our desired sample position given in seconds calculating the index of the sample in the float array is a piece of cake:

We simply divide our time given in seconds by the sampling rate to get the index for the sample that encodes the amplitude of that point in time. And that my dear friends is almost everything you need to know if you only care for outputting PCM data to the sound card (which a library like Java Sound or the mediaframework on Android does internally for you). On Android you could send PCM data directly to the sound card via the AudioTrack class, a similar mechanism also exists for Java Sound.

Just for fun let us create the PCM data for a mono sound at 440Hz. This frequency is equivalent to the note A in music. Sound is nothing more than waves which we usually describe as a so called sinusoid. In reallife a sound signal is the sum of many many such sinusoids which, added together, form the overall audio of a sound signal. When we want to create a a sound for the note A we simply have to generate a sinusoid that has a period of 1/440 seconds.

The x-Axis in this graph is time, the y-Axis the amplitude of our sine wave. The period is 0.002 seconds. By period we mean the time needed from one peek (either on the positive y-Axis or the negative y-Axis) to the next one.

Let’s write a function which can generate PCM data for any note in mono given the note’s frequency, the sampling rate and the duration of the note in seconds:

Now that is not a lot of code but the few expressions in it seem somewhat mysterious. Let’s break it down. We first create a float array which will hold our PCM samples. The length of the array is given by the duration times the samplingRate, that is the numbers of samples per second. If we have a duration of 5 seconds we’d neet 44100 samples times 5. Not to hard. The next line calculates an increment. An increment of what? Of an angle we’ll later on pass to Math.sin(). As i said earlier we are working with sinusoids, or sine waves. To generate those we use the sine function. The sine function in Java takes an angle in radians and outputs the sine for this angle. A complete revolution and therefor period is 2 * PI radians long, 360° in angle notation. If we wanted a 1Hz sound we’d go one full circle for 44100 samples. For 440Hz we need to go 440 full circles and so on. Thus we first multiply 2 * PI times the frequency. The division by the sampling rate will give us the final increment we add to the current angle for the current sample. In the loop we set the current sample to the sine of the current angle and increment the angle by the value we calculated before. And tada we now know everything we need to write a synthesizer. A synthesizer does nothing more than generating PCM samples with methods very similar to the above. By adding together a couple of functions a synthesizer can produce quiet interesting and complex sounds. The base of it all is the sine though (or a saw function, or a rectangular function). What this function lacks is a way to manipulate the loudness or volume of the generated sine wave. The Math.sin() function gives us the sine of the unit circle, so we get the loudest possible amplitude in the range [-1,1]. If we wanted to make the generated sound half as loud we’d simply multiply the sine by 0.5. To get a quater of the full loudness we multiply by 0.25 and so on. Pretty simple huh?

That concludes the first part of this series. Next time we’ll see what all this fourier transform business is about. Don’t worry, we will not use any advanced math, not even sines :). Now go out, research how the Java sound library works and make your ears blead with self generated sine waves!

6 thoughts on “Onset Detection Part 1: The Basics

  1. Hi,

    I’m reading your sound processing articles, and I really enjoy them. Thanks for all those explanations.

    Just a few trivial notes:

    – You use the notation Mb for MegaBytes; MB is more appropriated (Mb could stand for Megabits)
    – There is a typo in one of your code: samlpingRate instead of samplingRate
    – I also liked the “We can easily convert between 16-bit signed integer samples and 32-bit float samples by division and multiplication (i leave the details to you 🙂 ).” where finally you explain in details what to do 🙂

    Anyway, great article!

  2. shouldn’t the index calculation rather be done this way:

    int index = (int)(time * samplingRate);

    time / samplingRate is always 0.0…

    anyways great series! 😉

  3. Orly?

    5 / 44100 == 0.0?

    Excellent tutorials though. Good to get some meat that’s chewable for beginners and not having to rely on third party libraries so much.

  4. Snap. While time / samplingRate doesn’t equal 0.0, it should indeed be calculated by multiplying time by the sampling rate.

  5. It all works on desktop version, but if i compile it for android and run it, it says: noclassdeffound cause of MP3AudioFileReader
    Anyone else got this problem?

Leave a Reply

Your email address will not be published.