Week 9: Mel-Frequency Cepstral Coefficients (MFCC)

Chris Tralie

Today we're going to be talking about a classical timbre feature known as the Mel-Frequency Cepstral Coefficient (MFCC) transform. This feature was originally used to compress speech for analysis, but it has seen wide use in music information retrieval. As we will see, it basically throws out most information of absolute pitch, but retains information about timbre.

Part 1: Inverting Mel-Spaced Spectrograms

The first step of the MFCC is to pass from an ordinary spectrogram representation to a mel-space spectrogram. As we discussed, we can accomplish this by multiplying a spectrogram on the left by a matrix M in which each row is a different triangle centered on a different band. This matrix M is referred to as a Mel Filterbank. Below is an image of a mel filterbank

And below is an image of the corresponding multiplication:

As we discussed in module 16, though, we are losing information when we do this. It is often informative to "sonify" what information was retained so we can listen to what the computer hears at this point.

Your Task

Click here to download the starter code for today. It contains an implementation of a mel-spaced filterbank, as well as code to do an STFT. You can edit the notebook MFCC.ipynb.

Below are the steps to perform and sonify the mel spectrogram

  1. Create a magnitude mel-spaced spectrogram MS by multiplying on the left by a matrix M
  2. Multiply MS on the left by the transpose of M; that is, M where the rows are switched for the columns. You can obtain this with M.T in numpy. This will yield a matrix with the original shape of the STFT, which is the closest approximation we can get to the STFT after we've applied the mel filterbank. We'll check in as a class after this step
  3. Apply random phases to this new spectrogram
  4. Mirror this new spectrogram and perform the inverse STFT, then listen to the result.

Part 2: Mel-Frequency Cepstral Coefficients

There is another set of steps we do to to the mel-spectrogram to reduce the dimension of the features even more, while still retaining important information. The steps culminate in something called the Mel-Frequency Cepstral Coefficients (MFCCs). In this section, you'll fill in the method mfcc in mfcc.py to accomplish this. The steps are below:

  1. First, compute the spectrogram and convert to a mel-spaced spectrogram using n_bands bands and frequencies between 1hz and f_max hz. Typically, n_bands = 40 and f_max = 8000
  2. Take the log amplitude of the MFCC coefficients. You can use amplitude_to_db to help
  3. Take the "Discrete Cosine Transform Type 3" (DCT-III) of each column of this log amplitude mel-spaced spectrogram. This can be accomplished by multiplying by the left on a matrix like this, which you get from calling the provided get_dct_basis method:

    Doing this multiplication is like taking a truncated Discrete Fourier Transform of each row of the log amplitude, mel-spaced spectrogram. The DCT-III is just a slightly different "basis" for frequencies than the DFT, and it's a purely real basis (no phase).

If this is all working properly, the result on doves.wav should be this:

And here's what it looks like if we zero out the first row

Genre Identification