Assignment 5: Let It Bee (45 Points)

Chris Tralie

Click here to listen to the musical statements

Background / Logistics

We saw in class that it is possible to use Nonnegative Matrix Factorization to decompose an audio clip into a set of K sonic source templates stored in an win_length x K matrix W, as well as a matrix of activations over time for each of these sources stored in a K x nwin matrix H so that the matrix multiplication WH approximates an absolute magnitude spectrogram V. The main application we focused on was "unmixing," or separating an audio track into its different instrument components. This is also sometimes referred to as the "cocktail party problem," since we're trying to filter out one sound from the superposition of many, just like one might try to focus on the speech coming from the person in front of them in the midst of a cacophony of sound at a cocktail party.

In addition to audio unmixing and learning musical instrument templates from audio, the mathematics that were developed for NMF can also be used to create a novel instrument for musical expression. In addition to being given a spectrogram V, we are also the W of templates, which remains fixed, and our job is only to learn H. In this way, we can think of the problem not as one of unmixing, but of learning how to activate a set of templates we're given to best match a target V. This is referred to as "musaicing" in this paper, and you will be following that paper in the assignment. The musaicing technique in that paper is referred to as the "Let It Bee" technique, and it earned its name by showing how using V as the spectrogram for a clip of Let It Be from The Beatles, and inputting W as the spectrogram of a bunch of bees buzzing

V: Let It Be Spectrogram

W: Bees Buzzing Spectrogram

Learning H and inverting W*H: Let It Bee

In this assignment, you will implement the "Let It Bee" pipeline step by step and then use it to create musical statements.

Learning Objectives

  • Practice numpy arrays, methods, and for loops in the service of musical applications
  • Modify algorithms for nonnegative matrix factorization
  • Programmatically promote temporal continuity in audio reconstructions
  • Compose music in a cutting edge style

What To Submit

When you are finished, submit any python files an notebooks you created to canvas, as well as an audio file for your musical statement and all of the audio files that are needed to run your code used to create that statement. Finally, indicate a title for your musical statement, and name/pseudonym you'd like to use in the music gallery on our class web site, and indicate the names of any buddies you worked with.

Programming Tasks

There is no starter code for this assignment, and you will have complete control over the code! You may find it helpful to look back at the code/notes from class on NMF. I will do my best to explain the steps below, but you should also refer to the original "Let It Bee" paper for tips.


Step 1: Basic Setup (10 Points)

Your Task

As a first working prototype for musaicing, create a method that performs the following steps:

  1. Take the STFT of templates audio (e.g. bees clip) using librosa.stft and call this WComplex
  2. Take the magnitude of the above and call it W
  3. Compute the magnitude of the STFT of the target clip (e.g. Beatles clip) as V
  4. Keeping W fixed, perform L=50 iterations of Kullback-Liebler Divergence NMF to update an H. This is the meat of the musaicing method to estimate how to activate columns of W (the bees buzzing) to best approximate columns of V (The Beatles)
  5. Multipy WComplex by H to create the non-redundant spectrogram S for the musaic
  6. Invert S using librosa.istft to get your audio samples back in the time domain that you can listen to

The code below shows how you might set this up in a notebook:

The result is as follows.

This leaves a lot to be desired, so we will be improving it step by step in the assignment.

Step 2: Griffin Lim Phase Improvement (5 Points)

One problem with a simple nonnegative matrix factorization for musaicing is that each window is treated independently in the objective function. Furthermore, the activations are only learned for the magnitudes W, and we're reconstructing a sound with the complex STFT WComplex, which includes phases that were completely ignored when learning H. To improve phase continuity from one window to the next, we can perform several iterations of the Griffin Lim algorithm on the spectrogram S = before returning the audio, rather than just doing a straight inverse STFT.

Your Task

Look back at what you did on assignment 3, and use similar code with librosa.stft and librosa.istft to perform 10 iterations of Griffin-Lim before doing the final inverse STFT. Once you've done this, the result will improve slightly, as shown below

Step 3: Avoiding Repeated Activations (6 Points)

One of the issues with the above sound is we hear a "jitter" or "echo" that occurs when the same source window is activated multiple time instants in a row. To show an isolated example of this, here's what we get when we do nonnegative matrix factorization on "When Doves Cry" and activate only the first component for 20 frames

The window by itself sounds like this

And the repeated activations of it from the code above sound like this

To avoid this, we want to avoid having the same activations within some window of each other.

Your Task

Suppress all values in H that are not the maximum of the 2r+1 values in a window centered on them. In particular, after the KL iteration, do the following update to H:

\[ H[i, j] = \left\{ \begin{array}{cc} H[i, j] & H[i, j] == \max(H[i, j-r:j+r]) \\ \left(1 - \frac{n}{L} \right) H[i, j] & \text{otherwise} \end{array} \right\} \]

where n is the iteration number and L is the total number of iterations, and r is half of the length of a horizontal window in which to look for a max for every element of H. In other words, as time goes on, make the horizontal maxes stand out more and more, and by the end, the surrounding elements should be 0. If this works properly, you should hear the following for r=3:

And here's an example with r = 7:


You might want to make use of the method scipy.ndimage.filters.maximum_filter, which will return a matrix where every pixel is replaced by its max in some horizontal window. We've used this trick before with beepy tunes and superflux.

Step 4: Restricting Simultaneous Activations (7 Points)

In addition to constraints that we put in the rows of H, we can also put a sparsity constraint on the columns. This is because we want to limit the number of possible sound grains that are taken from the source at any point in time. If we take too many sounds at once, then they may mix together to form a new timbre that is different from the original timbre of the sources (perhaps too many bees together really do sound like a piano).

Your Task

To follow conventions in the paper, let's say that we want at most p simultaneous activations at any point in time. Ensure that that each element is within the p largest elements in its column of H. If it is smaller than that, then shrink it by a factor of (1-n/L), just as with the repeated activations. Add code that does this directly after the repeated activations code.

Here's an example where we allow 10 simultaneous activations

Here's an example where we allow only 3 simultaneous activations

Step 5: Diagonal Enhancement (12 Points)

One last observation we make is that the window lengths are quite short relative to the length of natural sounds that can be found in the sources. For example, at a sample rate of 22050 and for a window length of 2048, each window only captures about 100 milliseconds of audio. As we saw in the digital instruments assignment, the attack/sustain/decay/release can take longer than that to fully evolve a timbre, so we might like to encourage the algorithm to choose longer chunks from the source.

Since the sound templates in W happened to be obtained from the windowed spectrogram of source audio, adjacent columns of W store spectrogram magnitudes of adjacent windows from the source audio. This means that adjacent rows in H store activations of time-adjacent source elements. Therefore, we can encourage the algorithm to pick contiguous sequences of windows, and hence longer sounds from the source, by enhancing diagonal lines in the H matrix according to the following equation, where c refers to half of the length of the window in which diagonal elements are summed

\[ H[i, j] = \sum_{k = -c}^{c} H[i+k, j+k] \]

Your Task

Implement the above equation in your code after restricting simultaneous activations. Be careful not to go out of bounds.

NOTE: For full credit on this, you should avoid a triply-nested loop, which takes time proportional to the size of H times 2c-1. Instead, your method should take time proportional to the number of elements of H only. You can accomplish this by using cumulative sums. You may want to refer back to how we did this to add up zero crossings in a window.

Below are a few examples

Musical Statement (5 Points)

Now that you've created the musaicing system, use it to create your own novel compositions! Come up with some sound sources and a target, and go to town. Be sure to tweak the parameters as necessary to get the best quality sounds. I can't wait to hear what you come up with!