Demo: Expectation-Maximization
This demo fits a two-component Gaussian mixture with fixed component variance. It is meant to make the hidden-label calculation visible rather than to be a full clustering package.
Mathematical setup
Let $Z_i\in{1,2}$ be an unobserved component label and
\[X_i\mid Z_i=1\sim N(\mu_1,1), \qquad X_i\mid Z_i=2\sim N(\mu_2,1), \qquad \Pr(Z_i=1)=\pi.\]For current parameters $\theta^{(t)}=(\pi^{(t)},\mu_1^{(t)},\mu_2^{(t)})$, the E-step computes responsibilities
\[r_i^{(t)} = \Pr(Z_i=1\mid X_i,\theta^{(t)}) = \frac{\pi^{(t)}\phi(X_i;\mu_1^{(t)},1)} {\pi^{(t)}\phi(X_i;\mu_1^{(t)},1)+(1-\pi^{(t)})\phi(X_i;\mu_2^{(t)},1)}.\]The M-step updates $\pi$, $\mu_1$, and $\mu_2$ by weighted averages. The observed-data log likelihood should not decrease, apart from numerical rounding.
What to try
- Start with well-separated components. Responsibilities should be close to 0 or 1 for most observations after a few iterations.
- Reduce the true separation. The soft assignments become less certain, and the fitted means can move more slowly.
- Change the seed. EM depends on the sample and can also be sensitive to initialization in more general mixture problems.
EM alternates responsibilities $P(Z_i=1\mid X_i,\theta^{(t)})$ with weighted maximum-likelihood updates. The observed-data likelihood should be nondecreasing, up to numerical rounding.
Try it in Python
This compact EM loop reproduces the hidden-label responsibilities and nondecreasing observed-data log likelihood.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
sep = 2.4
n = 80
iterations = 8
rng = np.random.default_rng(531)
true_pi = 0.45
z = rng.binomial(1, true_pi, size=n)
mu_true = np.where(z == 1, -sep / 2, sep / 2)
x = rng.normal(mu_true, 1.0)
pi = 0.5
mu1, mu2 = -0.5, 0.5
history = []
for step in range(iterations + 1):
dens1 = pi * stats.norm.pdf(x, mu1, 1.0)
dens2 = (1 - pi) * stats.norm.pdf(x, mu2, 1.0)
r = dens1 / (dens1 + dens2)
loglik = np.sum(np.log(dens1 + dens2))
history.append((step, pi, mu1, mu2, loglik))
if step == iterations:
break
pi = r.mean()
mu1 = np.sum(r * x) / np.sum(r)
mu2 = np.sum((1 - r) * x) / np.sum(1 - r)
for step, pi, mu1, mu2, loglik in history:
print(f"{step:2d}: pi={pi:.3f}, mu1={mu1:.3f}, mu2={mu2:.3f}, loglik={loglik:.2f}")
plt.plot([row[0] for row in history], [row[4] for row in history], marker="o")
plt.xlabel("EM iteration")
plt.ylabel("observed-data log likelihood")
plt.show()
