Notes on Dynamic Programming Beat Tracking
Chris Tralie
Now that we have good audio novelty functions and tempo estimation schemes, we can use them to go a finer grained representation of rhythm and actually have the computer "tap its virtual foot to the beat." We designed audio novelty functions to spike for likely beat events, but not all peaks correspond to true beats (false positives), and not all beats show up as spikes (false negatives). Therefore, we somehow have to balance picking tap times in the audio novelty function that correspond to high values, while also skipping some high values or picking low values every once in a while to keep the tempo going.
Objective function
To accomplish this, we will implement what is now a classical technique by following this 2007 paper by Dan Ellis. This technique uses dynamic programming to solve an optimization problem trading off peak picking in the audio novelty function and consistency with a pre-determined tempo. First, let's define an objective function to maximize, and then we will explain how to maximize it to obtain an optimal sequence of beats using dynamic programming and backtracing. Let's start with the following parameters
- An audio novelty function n[i]
- A tempo interval T in units of the domain of the audio novelty function, specified ahead of time
- a sequence of B beats b[j], which are also specified in units of the domain of the audio novelty function, so that n[b[j]] gives the novelty function associated with the jth beat
\[ c(n, b, T) = \sum_{j = 1}^{N-1} \left( n[b[j]] - \alpha \left| \log \left( \frac{b[j]-b[j-1]}{T} \right) \right| ^2 \right) \]
Given a novelty function n and a tempo T, this function gives a score to a particular choice of beats b. The higher the score, the better the beats b are considered to be.
Since we want to maximize this function, the n[b[j]] term for each beat promotes choosing beats at locations where the novelty function is high. But we also want to make sure we keep a consistent tempo, so the second term penalizes deviations from the tempo T between chosen beat locations. If the tempo is perfectly T, then b[j] - b[j-1] will always be T, so the second term will be 0. But the moment the tempo is too fast or too slow between two adjacent beats, that term will deduct from the objective function.
Still, we may be happy to accept a slight penalty in tempo so that our beats reside at locations of high novelty. The α parameter chooses how much we want to penalize for deviations of tempo. A higher α will lead to a more consistent tempo, but the beats may not exactly align with rhythmic events. By contrast, a low α will cause the computer to tap its foot very consistently with rhythmic events, but the tempo may be all over the place. Refer to assignment 4 for some examples of different choices of α
Solving for beats
Given a tempo interval T (the interval between beats and seconds) and an audio novelty function n[i], we can solve for optimal beats via dynamic programming. We need to create two 1D arrays, cscore
and backlink
, each with as many samples as there are samples in the audio novelty function:
cscore[i]
is a dynamic programming array that stores the maximum possible value of the objective function when solving a subproblem from index 0 to index i of the audio novelty function.backlink[i]
stores the index of the last beat that was chosen before index i in the optimal solution with scorecscore[i]
It's much easier to understand what cscore
and backlink
are, and how to construct them, by looking at a visual, as shown below.
Like the objective function, cscore
is constructed by trading off fit to the audio novelty function and adherence to the tempo T. Each value of cscore
is determined recursively in terms of a previous value of cscore
. In particular, cscore[t]
is the sum of the audio novelty function at t
, plus the maximum preceding cscore
value assuming a tempo interval between T/2 and 2T. We remember the index of the maximum preceding index in backlink
.
Algorithm to construct cscore
and backlink
To fill these in, we need a loop that builds up the elements of cscore
from left to right. Below is the algorithm to do this using a recurrence relation that is solved with dynamic programming, which matches the animation above. Note that for simplicity of the description, the tempo T is assumed here to be in units of samples of the novelty function per beat, which is a period, though the tempo is provided in units of beats per minute, which is a frequency, as an argument to the method. Also, when you write the code, you will have to be careful to round and cast variables as int
when necessary.
Pseudocode
-
for i from 2T to len(cscore)
-
To solve for
cscore[i]
, the optimal score up to index i, we suppose that we're choosing a beat at index i. Then, we recursively compute the score, via dynamic programming, by checking over possible beat indices j that could represent the beat that comes directly before i. The recurrence for this is as follows:\[ \text{cscore}[i] = \text{max}_{j = i-2T}^{i-T/2} \left( n[i] + \text{cscore}[j] - \alpha \left| \log \left( \frac{i-j}{T} \right) \right|^2 \right) \]
We restrict ourselves to previous beat locations j that occur between half of the tempo and 2x the tempo before index i. We get credit for
n[i]
, the novelty function for our newest beat at index i, as well ascscore[j]
, the best score of all possible beats chosen up to and including index j. But we also deduct a penalty depending on the interval between beat j and beat i.Along with storing the maximum possible value in
cscore[i]
, we also store the the index j that led to the maximum inbacklink[i]
.
-
To solve for
-
Once the entirety of
cscore
andbacklink
is filled in, we can extract the optimal beat locations. Starting at the index that maximizescscore
, we follow the links inbacklink
to pick all of the beats. We will be finished once we get back to the beginning. For example, suppose we had the followingbacklink
array (which, for pedagogical purposes, is much short than what you will get on real audio data)backlink index 0 1 2 3 4 5 6 7 8 9 10 value 0 0 0 1 2 2 5 4 6 5 8 We can represent this array visually as a 1D directed acyclic graph
Then we simply follow the arrows until we get to 0. For example, let's say
cscore[9]
was the maximum. Then we follow the path9 -> 5 -> 2 -> 0
For another example, if
cscore[10]
was the maximum, we follow the path10 -> 8 -> 6 -> 5 -> 2 -> 0
You should of course reverse the order of this list so that the beats you return start at smaller indices, and you should also convert them from units of the audio novelty function to beats per minute.
Matlab code
You may also be interested in referencing the matlab code in the paper, but you should be careful since although Matlab is a predecessor to numpy, the syntax is different and may throw you off a bit.