Demo: Expectation-Maximization

This demo fits a two-component Gaussian mixture with fixed component variance. It is meant to make the hidden-label calculation visible rather than to be a full clustering package.

Mathematical setup

Let $Z_i\in{1,2}$ be an unobserved component label and

\[X_i\mid Z_i=1\sim N(\mu_1,1), \qquad X_i\mid Z_i=2\sim N(\mu_2,1), \qquad \Pr(Z_i=1)=\pi.\]

For current parameters $\theta^{(t)}=(\pi^{(t)},\mu_1^{(t)},\mu_2^{(t)})$, the E-step computes responsibilities

\[r_i^{(t)} = \Pr(Z_i=1\mid X_i,\theta^{(t)}) = \frac{\pi^{(t)}\phi(X_i;\mu_1^{(t)},1)} {\pi^{(t)}\phi(X_i;\mu_1^{(t)},1)+(1-\pi^{(t)})\phi(X_i;\mu_2^{(t)},1)}.\]

The M-step updates $\pi$, $\mu_1$, and $\mu_2$ by weighted averages. The observed-data log likelihood should not decrease, apart from numerical rounding.

What to try

  • Start with well-separated components. Responsibilities should be close to 0 or 1 for most observations after a few iterations.
  • Reduce the true separation. The soft assignments become less certain, and the fitted means can move more slowly.
  • Change the seed. EM depends on the sample and can also be sensitive to initialization in more general mixture problems.

EM alternates responsibilities $P(Z_i=1\mid X_i,\theta^{(t)})$ with weighted maximum-likelihood updates. The observed-data likelihood should be nondecreasing, up to numerical rounding.

Try it in Python

This compact EM loop reproduces the hidden-label responsibilities and nondecreasing observed-data log likelihood.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

sep = 2.4
n = 80
iterations = 8
rng = np.random.default_rng(531)

true_pi = 0.45
z = rng.binomial(1, true_pi, size=n)
mu_true = np.where(z == 1, -sep / 2, sep / 2)
x = rng.normal(mu_true, 1.0)

pi = 0.5
mu1, mu2 = -0.5, 0.5
history = []

for step in range(iterations + 1):
    dens1 = pi * stats.norm.pdf(x, mu1, 1.0)
    dens2 = (1 - pi) * stats.norm.pdf(x, mu2, 1.0)
    r = dens1 / (dens1 + dens2)
    loglik = np.sum(np.log(dens1 + dens2))
    history.append((step, pi, mu1, mu2, loglik))

    if step == iterations:
        break

    pi = r.mean()
    mu1 = np.sum(r * x) / np.sum(r)
    mu2 = np.sum((1 - r) * x) / np.sum(1 - r)

for step, pi, mu1, mu2, loglik in history:
    print(f"{step:2d}: pi={pi:.3f}, mu1={mu1:.3f}, mu2={mu2:.3f}, loglik={loglik:.2f}")

plt.plot([row[0] for row in history], [row[4] for row in history], marker="o")
plt.xlabel("EM iteration")
plt.ylabel("observed-data log likelihood")
plt.show()

Back to topic notes