Week 10: Nonnegative Matrix Factorization (NMF)

Chris Tralie

Click here for code that goes with these notes

Nonnegative Matrix Factorization Theory

The goal of nonnegative matrix factorization (NMF) is to approximate a matrix 𝑉 as the product of two other matrices π‘Š and 𝐻; that is to get as close as possible to the matrix equation 𝑉 = π‘Šπ» being true. Solving this problem will allow us to learn spectrogram templates in π‘Š and activations in 𝐻 whose product π‘Šπ» approximates a spectrogram 𝑉.

The question itself, though, should already seem a little weird, because even if π‘Š, 𝐻, and 𝑉 are 1x1 matrices (i.e. ordinary nonnegative numbers), there can be infinitely many solutions. For instance, if I ask you how you could multiply two numbers together to get the number 12, you could do the following:

  • 1 x 12 = 12
  • 2 x 6 = 12
  • 3 x 4 = 12
  • 2.4 x 5 = 12
  • ...

But surprisingly, when we do this with matrices, sometimes just finding one factorization will be quite useful. Crucially, though, we assume that all of the entries of π‘Š, 𝐻, and 𝑉 are nonnegative. This is perfect for magnitude spectrograms, since they are nonnegative! Furthermore, this constraint allows us to devise some simple iterative algorithms to solve NMF numerically in practice. There is some great theory in the above paper to show that these algorithms converge and are related to something else we'll be talking about when we get to neural nets: gradient descent.

There is also other theory that characterizes probability distributions over the entries of the matrices we expect to get based on the different algorithms we choose (click here to see a paper about this). This is how we can characterize the solutions that we're more likely to get over a possibly infinite set of solutions.


Formulation 1: Euclidean NMF

We seek a π‘Š and an 𝐻 so that we minimize the loss function (or what you may have called the "objective" function in Math 111)

\[ L(W, H) = \sum_{i, j} (V[i, j]-WH[i, j])^2 \]

  1. First, randomly initialize π‘Š and 𝐻 with nonnegative numbers
  2. If 𝑉 is a 𝑀π‘₯𝑁 matrix, and we choose a number 𝐾 which is the "rank" of our approximate factorization
    • π‘Š is a 𝑀π‘₯𝐾 matrix
    • 𝐻 is a 𝐾π‘₯𝑁 matrix
    • π‘Šπ‘‡ is called the "transpose" of π‘Š ;it's simply switching the rows for the columns. It is a 𝐾π‘₯𝑀 matrix
    • 𝐻𝑇 would be a 𝑁π‘₯𝐾

To solve this, you'll follow the multiplicative update rules below in sequence; that is, swap back and forth between the first step and the second step in a loop.

  • \[ H[i, j] = H[i, j] \frac{ (W^TV)[i, j] }{(W^TWH)[i, j]} \]

  • \[ W[i, j] = W[i, j] \frac{ (VH^T)[i, j] }{(WHH^T)[i, j]} \]

  • The original paper that presented this algorithm proved that the loss function only decreases at every iteration.

    Below is code that implements this

    And below is the result of using K=3 components with the above code on a 10 second clip from Prince's "When Doves Cry":

    Components Filtered
    0
    1
    2

    Formulation 2: Kullback-Liebler Divergence NMF

    There are other loss functions we can use other than the sum of the squared differences of the entries of V and WxH. One of them is called the "Kullback-Liebler" loss, and it's defined as follows

    \[ \sum_{i, j} \left( V[i, j] \left( \log \frac{V[i, j]}{WH[i, j]} \right) - V[i, j] + WH[i, j] \right) \]

    Notice how, like the Euclidean loss, this loss is 0 if V = WxH. As it turns out, this loss will decrease under the following update rules

    • \[ V_L[i, j] = V[i, j] / (WH)[i, j] \]
    • \[ H[i, j] = H[i, j] \left( \frac{ (W^T V_L)[i, j] }{{\sum_a}W[a, j]} \right) \]
    • \[ V_L[i, j] = V[i, j] / (WH)[i, j] \]
    • \[ W[i, j] = W[i, j] \left( \frac{ (V_L H^T)[i, j] }{{\sum_b}H[i, b]} \right) \]

    Below is code that implements this

    Below are the audio results for K=3

    Components Filtered
    0
    1
    2

    Below are animations of the process on spectrograms for different numbers of components:

    K = 3

    K = 5