Mel Filterbanks And Mel Spectrograms

Chris Tralie

In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import librosa
from MelFeaturesSol import *

Let's extract the spectrogram of this timeless tune by Jimi Hendrix

In [2]:
x, sr = librosa.load("jimi.wav")
ipd.Audio(x, rate=sr)
Out[2]:

Drawbacks of the spectrogram

  1. Pitch is perceived exponentially in frequency $f = 440*(2^{p/12})$

  2. Intensity is perceived logarithmically in loundess

  3. Spectrogram has a lot of freuqency bins, probably more than we need or are useful for certain tasks. So we also want to do a "dimension reduction" or a "lossy compression" of the spectrogrma, that hopefully retains important musical aspects

In [3]:
hop_length = 512
win_length = 2048
S = librosa.stft(x, hop_length=hop_length, win_length = win_length)
S = np.abs(S)
Sdb = librosa.amplitude_to_db(S,ref=np.max)
plt.imshow(Sdb, aspect='auto')
plt.gca().invert_yaxis()

To address points 1 and 3, we're going to create something called the "mel filterbank," which is a way of blurring a bunch of frequencies together in wider and wider ranges as the center frequency of the bin goes up. This filterbank consists of a bunch of triangles. Under each triangle, we sum the squared amplitudes of the corresponding spectrogram frequencies, scaled by the triangle.

In [4]:
from MelFeaturesSol import *
M = get_mel_spectrogram(S.shape[0], win_length, sr, 80, 8000, 100)
print(M.shape)
plt.figure(figsize=(10, 4))
plt.plot(M.T)
plt.xlabel("Frequency Bin")
#plt.gca().set_xscale('log')
(100, 1025)
Out[4]:
Text(0.5, 0, 'Frequency Bin')

Here's what this filterbank looks like as a matrix

In [5]:
plt.imshow(M, aspect='auto')
Out[5]:
<matplotlib.image.AxesImage at 0x7f7a9ac76b10>

The matrix perspective allows us to very efficiently apply the filterbank to the spectrogram, because we simply multiply on the left by the matrix. The dimensions work out here, because the mel filterbank is a 100x1025 matrix, and the spectrogram is a 1025x431 matrix so their multiplication yields a 100x431 matrix. Each row of the resulting matrix is the result of taking the dot product of a single triangle with every column in the spectrogram. Since the triangles were centered on a different frequency range, each row of the mel spectrogram can be thought of as a different frequency range and its changes over time

In [6]:
mel_specgram = np.log10(M.dot(S**2))
plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.imshow(M, aspect='auto')
plt.title("Mel filterbank ({} x {})".format(M.shape[0], M.shape[1]))
plt.gca().invert_yaxis()
plt.subplot(132)
plt.imshow(Sdb, aspect='auto')
plt.gca().invert_yaxis()
plt.title("Spectrogram in dB ({} x {})".format(Sdb.shape[0], Sdb.shape[1]))
plt.subplot(133)
plt.imshow(mel_specgram, aspect='auto')
plt.gca().invert_yaxis()
plt.title("Mel Spectrogram ({} x {})".format(mel_specgram.shape[0], mel_specgram.shape[1]))
Out[6]:
Text(0.5, 1.0, 'Mel Spectrogram (100 x 431)')
In [ ]: