Expectation Maximization

December 5, 2018

The EM algorithm proceeds by alternatingly taking the expectation of a function $Q$ related to the log-likelihood and maximizing that same function in order to get converging estimates $\theta^{t}$ for some parameter $\theta$. The inspiration is computing the maximum likelihood estimate for some parameter $\theta$ based on some observed data $\left(x_1, x_2, …, x_i, …, x_N\right)$ and some latent variables $\left(z_1, z_2, …, z_i, …, z_N\right)$.

To compute the MLE you maximize the “incomplete data” log-likelihood

\[l \left( \theta | \mathbf{x} \right) = \log \left( \prod_{i=1}^N p\left(x_i | \theta \right)\right) =\sum_{i=1}^N \log \left( p\left(x_i | \theta \right)\right) = \sum_{i=1}^N \log \sum_{z_i}p\left( x_i, z_i | \theta \right)\]

In general you can’t do this because you can’t “distribute” log across the sum over $z_i$. Instead maximize the “complete data” log-likelihood:

\[l_c \left( \theta | \mathbf{x} \right) = \sum_{i=1}^N \log \big[ p\left( x_i, z_i | \theta \right) \big] = \sum_{i=1}^N \log \big[ p\left( z_i | \theta \right) p\left( x_i | z_i , \theta \right) \big]\]

Too bad you can’t compute this either since the $z_i$ are unobserved.

Insight

Fortunately you can conditionon the data $x_i$ and previous estimate of the parameters $\theta^{t-1}$ and then take expectations, i.e. take the expectation of $l_c \left( \theta \right) $ w.r.t $p\left(z_i | x_i, \theta^{t-1}\right)$.

Take a moment to note the shift in paradigm here (at least for me): treat the likelihood as a random variable that’s a function of $z_i$ and marginalize out that dependence.

The auxiliary function $Q\left(\theta, \theta^{t-1}\right)$

This marginalization serves the purpose of “filling in” the missing $z_i$ values and producing a function (the auxiliary function $Q\left(\theta^t, \theta^{t-1}\right)$ that’s only a function of $\theta$ which you can then maximize.

\[\begin{eqnarray} Q\left( \theta, \theta^{t-1} \right) &=& E\big[l_c \left( \theta \right) | x_i, \theta^{t-1} \big] \nonumber \\ &=& E\Bigg[ \sum_{i=1}^N \log \big[ p\left( z_i | \theta \right) p\left( x_i | z_i , \theta \right) \big] \Bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ &=& \sum_{i=1}^N E\bigg[\log \big[ p\left( z_i | \theta \right) p\left( x_i | z_i , \theta \right) \big] \bigg| x_i, \theta^{t-1} \bigg] \nonumber \end{eqnarray}\]

Keep in mind what $E\left[ g\left(z_i\right) | x_i, \theta^{t-1} \right]$ means here:

\[E\big[ g\left( z_i \right) | x_i, \theta^{t-1} \big] = \sum_{k=1}^K g\left( z_i \right) p\left(z_i = k | x_i, \theta^{t-1}\right)\]

i.e. an expectation over the conditional distribution of the $i$th latent $z_i$ given observed $x_i$ and current parameter estimate $\theta^{t-1}$.

Too bad you still can’t evaluate this expectation. Employ a trick (in the discrete case)

\[\begin{eqnarray} Q\left( \theta, \theta^{t-1} \right) &=& \sum_{i=1}^N E\bigg[\log \big[ p\left( z_i | \theta \right) p\left( x_i | z_i , \theta \right) \big] \bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ &=& \sum_{i=1}^N E\Bigg[\log \Bigg[ \prod_{k=1}^K \Bigg( p\left( z_i = k | \theta \right) p\left( x_i | z_i = k , \theta \right) \Bigg)^{I\left(z_i=k\right)} \Bigg] \Bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ &=& \sum_{i=1}^N E\Bigg[ \sum_{k=1}^K \log \Bigg[ \Bigg( p\left( z_i = k | \theta \right) p\left( x_i | z_i = k , \theta \right) \Bigg)^{I\left(z_i=k\right)} \Bigg] \Bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ &=& \sum_{i=1}^N E\Bigg[ \sum_{k=1}^K I\left(z_i=k\right)\log \big[ p\left( z_i =k | \theta \right) p\left( x_i | z_i =k , \theta \right) \big] \Bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ \end{eqnarray}\]

Now note that the only random variable in the final expression is $ I\left(z_i=k\right)$ because the probability factors in the log have $z_i = k$ fixed (the indicator “picked” the $k$). This is key to understanding the entire manipulation. Therefore

\[\begin{eqnarray} Q\left( \theta, \theta^{t-1} \right) &=& \sum_{i=1}^N E\Bigg[ \sum_{k=1}^K I\left(z_i=k\right)\log \big[ p\left( z_i =k | \theta \right) p\left( x_i | z_i =k , \theta \right) \big] \Bigg| x_i, \theta^{t-1} \Bigg] \nonumber \\ &=& \sum_{i=1}^N \sum_{k=1}^K E\big[ I\left(z_i=k\right) \big| x_i, \theta^{t-1} \big]\log \big[ p\left( z_i =k | \theta \right) p\left( x_i | z_i =k , \theta \right) \big] \label{eq:sample1} \\ \end{eqnarray}\]

The EM Algorithm

What’s the EM algorithm?

  1. Initialize parameters $\theta$ (somehow).

  2. Given the parameters $\theta^{t-1}$ from the previous iteration of the algorithm, evaluate $Q$ so that it’s only in terms of $\theta$. This is the expectation step and corresponds to evaluating the expecation in line \eqref{eq:sample1} above.

  3. Maximize $Q$ as a function of $\theta$. Go back to step 2 unless stopping criterion (e.g. $| \theta^t -\theta^{t-1}| < \delta$ for some $\delta$).

Gaussian Mixture Model

Suppose our model is hierarchical with

\[\begin{eqnarray} p\left( x_i | \theta\right) &\sim& \sum_{k=1}^K p\left( z_i = k\right) p\left(x_i | z_i = k, \mu_k, \sigma_k \right) \nonumber \\ x_i | z_i &\sim& N \left( \mu_k, \sigma_k \right) \nonumber\\ z_i &\sim& \mbox{Categorical}\left( \mathbf{\pi} \right) \nonumber \end{eqnarray}\]

So $z_i$ picks a Normal $k$ with probability $\pi_k$ and then $x_i$ is drawn from $N \left( \mu_k, \sigma_k \right)$. We can fit this model to data using EM.

The expectation step requires evaluating $E\big[ I\left(z_i=k\right) \big | x_i, \theta^{t-1} \big]$ in \eqref{eq:sample1}.

\[\begin{eqnarray} E\big[ I\left(z_i=k\right) \big | x_i, \theta^{t-1} \big] &=& p\left( z_i=k \big| x_i, \theta^{t-1} \right) \nonumber\\ &=& \frac{p\left( x_i \big| z_i=k, \theta^{t-1} \right) \cdot p\left( z_i=k \right)}{\sum_{j=1}^K p\left( x_i \big| z_i=j, \theta^{t-1} \right) \cdot p\left( z_i=j \right)} \nonumber \\ &=& \frac{ N\left( x_i | \mu_k^{t-1}, \sigma_k^{t-1}\right) \cdot \pi_k^{t-1}}{\sum_{j=1}^K N\left( x_i | \mu_j^{t-1}, \sigma_j^{t-1}\right) \cdot \pi_j^{t-1}} \nonumber \\ \end{eqnarray}\]

where we’ve used Bayes and where $N\left( x_i | \mu_k, \sigma_k\right)$ is shorthand for the pdf of a normal with mean $\mu_k$ and variance $\left(\sigma_k\right)^2$ evaluated at $x_i$. We see that $E\big[ I\left(z_i=k\right) \big | x_i, \theta^{t-1} \big]$ is only a function of the previous estimates $\mu_k^{t-1}, \sigma_k^{t-1}, \pi_k^{t-1}$ and so this completes the E step. Define $r_{ik}$ to be this (fixed per iteration) quantity (the “responsibility” of each Normal for each $x_i$) and proceed with simplifying $Q$:

\[\begin{eqnarray} Q\left( \theta, \theta^{t-1} \right) &=& \sum_{i=1}^N \sum_{k=1}^K r_{ik} \log \big[ p\left( z_i =k | \theta \right) p\left( x_i | z_i =k , \theta \right) \big] \nonumber \\ &=& \sum_{i=1}^N \sum_{k=1}^K r_{ik} \Bigg( \log \big[ p\left( z_i =k | \theta \right) \big] + \log \big[ p\left( x_i | z_i =k , \theta \right) \big] \Bigg) \nonumber \\ &=& \sum_{i=1}^N \sum_{k=1}^K r_{ik} \Bigg( \log \big[ \pi_k \big] + \log \Bigg[\frac{1}{\sqrt{\left( 2 \mbox{ pi }\right) \sigma_k^2}} \exp \bigg(- \frac{\left( x_i - \mu_k \right) ^2}{2\sigma_k^2} \bigg) \Bigg] \Bigg) \nonumber \\ \end{eqnarray}\]

Maximizing this can be done in two steps since the terms in the inner sum aren’t related (parameterized by $\pi_k$ and $\mu_k, \sigma_k$ respectively). Computing an expression for the $\pi$s can be done using the Lagrange multiplier $\lambda$ in order to enforce the constraint that $\sum_k \pi_k = 1$ (since $\pi$ is Categorical):

\[\frac{\partial}{\partial\pi_{k'}} \left[ \sum_{i=1}^N \sum_{k=1}^K r_{ik} \log \left[ \pi_k \right] + \lambda \left( \sum_{k=1}^K \pi_k -1 \right) \right] = 0\]

($k’$ indicates we’re taking the partial with respect to one of the $\pi = \left( \pi_0, \pi_1, …, \pi_K\right)$ i.e. $\pi_{k’}$, not to be confused with the sum index $k$)

\[\frac{1}{\pi_{k'}} \sum_{i=1}^N r_{ik'} + \lambda = 0\]

or

\[\begin{equation} - \sum_{i=1}^N r_{ik'} = \pi_{k'} \lambda \label{eq:sample2} \end{equation}\]

Here it’s useful to remember that $r_{ik’} = p\left( z_i=k’ | x_i, \theta^{t-1} \right) $ and so $\sum_{k’} r_{ik’} = 1$. Therefore after sum both sides of \eqref{eq:sample2} we get $\lambda = - N$ resulting in:

\[\pi_k = \frac{1}{N} \sum_{i=1}^N r_{ik}\]

which just intimates that $\pi_k$ is the average responsibility/contribution of Normal $k$ to datum $x_i$.

Computing the expressions for $\mu_k, \sigma_k$ is more annoying but follows the same procedure as standard MLE for Normals:

\[\sum_{i=1}^N \sum_{k=1}^K r_{ik} \log \left[ \frac{1}{\sqrt{\left( 2 \mbox{ pi}\right) \sigma_k^2}} \exp \left(- \frac{\left( x_i - \mu_k \right) ^2}{2\sigma_k^2} \right) \right] = \sum_{i=1}^N \sum_{k=1}^K \frac{r_{ik}}{2} \left( - \log \left[ \left( 2 \mbox{ pi}\right) \sigma_k^2 \right] - \frac{\left( x_i - \mu_k \right) ^2}{\sigma_k^2} \right)\]

Maximizing with respect to $\mu_{k’}$ we get

\[\mu_k = \frac{ \sum_{i=1}^N r_{ik} x_i}{ \sum_{i=1}^N r_{ik} }\]

Maximizing with respect to $\sigma_{k’}$ we get

\[\sigma_k = \frac{ \sum_{i=1}^N r_{ik} \left(x_i - \mu_k\right)^2}{ \sum_{i=1}^N r_{ik} }\]

(just follow the same procedure as for conventional MLE of Normal).