MLE applied to Gaussian mixtures#
Finding the maximum likelihood estimator becomes trickier when our data comes from a complex function of many parameters. A particular case of a reasonably complex likelihood that can still be maximized using a relatively straightforward numerical method is a mixture of Gaussians. As the name suggests, these models assume all the data points are generated from a combination of a finite number of normal distributions with unknown parameters. First, we’ll define the model and then discuss how to go about maximizing the likelihood.
Gaussian mixture model#
For a mixture of
Note that the data
Again, we can work with the natural log of equation
To solve this, we would need to take the partial derivatives on
which would result in a complex system of
The basics of the expectation maximization algorithm#
Interpreting equation
The EM algorithm is not limited to Gaussian mixtures, so instead of
Next, we can take a partial derivative of
and motivated by equation
Although this equation looks quite daunting, we can simplify it. The first term corresponds to the class probability given by equation
where
and from the normalization constraint that
where
M Step: Starting with a guess for
, the values of , , and are estimated using Eqs. – . This is the “maximization” M-step which brings the parameters closer to the local maximum.E step: Also known as the “expectation” step. Here,
are updated using Eq. .
The algorithm is not sensitive to the initial guess of parameter values. For example, setting all
The EM algorithm has a rigorous foundation, and it is provable that it will indeed find a local maximum of
EXAMPLE#
Imagine we have a Gaussian mixture of distributions
First we will define our distributions and combine them using numpy.concatenate
. Then we will create models using sklearn.mixture.GaussianMixture
that range from one class to ten classes and calculate the AIC and BIC to find the optimal number of classes for our data.
import numpy as np
from sklearn.mixture import GaussianMixture
random_state = np.random.RandomState(seed=1)
X = np.concatenate([random_state.normal(-1, 1.5, 350),
random_state.normal(0, 1, 500),
random_state.normal(3, 0.5, 150)]).reshape(-1, 1)
N = np.arange(1, 11)
models = [None for i in range(len(N))]
for i in range(len(N)):
models[i] = GaussianMixture(N[i]).fit(X)
AIC = [m.aic(X) for m in models]
BIC = [m.bic(X) for m in models]
Next, we’ll plot our results. By using np.argmin
on our AIC and BIC arrays, we can find the model with the most optimal .score_samples
on this model to compute the log-likelihood (the PDF of the sum of Gaussians). Then we can use .predict_proba
on our log-likelihood to get the density of the
Afterward, we can plot pdf
, our Gaussian mixture, and pdf_individual
for the three separate Gaussians along with the histogram of our data.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fig = plt.figure(figsize=(12, 5))
x = np.linspace(-8, 8, 1000)
M_best = models[np.argmin(AIC)]
logprob = M_best.score_samples(x.reshape(-1, 1))
responsibilities = M_best.predict_proba(x.reshape(-1, 1))
pdf = np.exp(logprob)
pdf_individual = responsibilities * pdf[:, np.newaxis]
labels = ['Best-fit Mixture','$\mathcal{N}(x|0,1)$',
'$\mathcal{N}(x|-1,1.5)$','$\mathcal{N}(x|3,0.5)$']
plt.hist(X, 100, density=True, histtype='stepfilled',
alpha=0.4,color = 'steelblue',edgecolor = 'black')
#Plot the Gaussian mixture
plt.plot(x, pdf, label = labels[0])
#Plot the individual Gaussians
for i, j in enumerate([1,2,3]):
plt.plot(x,pdf_individual[:,i],label = labels[j])
plt.xlabel('$x$', fontsize = 14)
plt.ylabel('$p(x)$', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.legend(fontsize = 12);

Additionally, we’ll first plot the model selection criteria (the AIC and BIC) as a function of the number of components (left panel). We’ll find that both the AIC and BIC are minimized for a three-component model. We’ll then plot the probability that a given point is drawn from each class as a function of its position (right panel), with the vertical extent of each region proportional to this class probability.”
fig = plt.figure(figsize=(13, 5))
fig.subplots_adjust(left=0.12, right=0.97,
bottom=0.21, top=0.9, wspace=0.5)
# plot 2: AIC and BIC
ax = fig.add_subplot(132)
ax.plot(N, AIC, '-k', label='AIC')
ax.plot(N, BIC, '--k', label='BIC')
ax.set_xticks(np.arange(1,10,1))
ax.set_xlabel('n. components')
ax.set_ylabel('information criterion')
ax.legend(loc=2)
# plot 3: posterior probabilities for each component
ax = fig.add_subplot(133)
p = responsibilities
p = p[:, (1, 0, 2)] # rearrange order so the plot looks better
p = p.cumsum(1).T
ax.fill_between(x, 0, p[0], color='gray', alpha=0.3)
ax.fill_between(x, p[0], p[1], color='gray', alpha=0.5)
ax.fill_between(x, p[1], 1, color='gray', alpha=0.7)
ax.set_xlim(-6, 6)
ax.set_ylim(0, 1)
ax.set_xlabel('$x$')
ax.set_ylabel(r'$p({\rm class}|x)$')
ax.text(-5, 0.3, 'class 1', rotation='vertical')
ax.text(0, 0.5, 'class 2', rotation='vertical')
ax.text(3, 0.3, 'class 3', rotation='vertical')
plt.show;

Useful SciPy functions#
Once we fit a model to our data, SciPy has a ton of very helpful functions; to name a few, we can call means_
, covariances_
and weights
to find
len = [0,1,2]
means = [(round(float(models[2].means_[i]),3)) for i in len]
covariances = [(round(float(models[2].covariances_[i]),3)) for i in len]
weights = [(round(float(models[2].weights_[i]),3)) for i in len]
print(f"""means = {means}
covariances = {covariances}
weights = {weights}""")
means = [-1.437, 3.019, 0.173]
covariances = [1.377, 0.213, 0.832]
weights = [0.293, 0.156, 0.551]
How to choose the number of classes?#
What if the number of Gaussian components was unknown? Using the model selection methods discussed in §4.3, we can find the optimal sklearn.mixture.GaussianMixture
. After this, we can call m.aic(X)
or m.bic(X)
, where
In our example above, we see that as
The EM algorithm as a classification tool#
The right panel in our example above shows the class probability for the optimal model (
Results analogous to our example above can be obtained in multidimensional cases, where the mixture involves multivariate Gaussian distribution. Here too, an optimal model can be used to assign a probabilistic classification to each data point, and this and other classification methods are discussed in detail in chapter 9.
How to account for measurement errors?#
So far, we have assumed that measurement error’s for
We’ll focus our discussion on Gaussian mixtures, and assume that the measurement errors follow Gaussian distributions with standard deviations
This correction procedure fails in the heteroscedastic case. Additionally, due to uncertainties in the best-fit values, it’s possible that the best-fit value of
A remedy is to account for measurement errors already in the model description: we can replace
Following the same derivation steps, the new prescriptions for the M-step are now
and
Compared to Eqs. 4 – 5,
Non-Gaussian mixture models#
The EM algorithm is not confined to Gaussian mixtures. As eq.