Stefano Cerri

Research Fellow @ Martinos

scerri at mgh dot harvard dot edu

Blog

Stefano Cerri

Research Fellow @ Martinos

scerri at mgh dot harvard dot edu

Homepage

Blog

Gaussian Mixture Model for brain MRI Segmentation

In the last decades, Magnetic Resonance Imaging (MRI) has become a central tool in brain clinical studies. Most of these studies rely on accurate and robust image segmentation for visualizing brain structures and for computing volumetric measures. Manual labeling of the human brain is a tedious and lengthy process; just to give you an idea, segmenting a 3D brain MR image into 40 brain structures can take up to 1 week, an amount of time that doctors and radiologists just don’t have. For these reasons, many automatic tools have been proposed in the last years, like Freesurfer, SPM, FSL.

In this post, I’ll describe the Gaussian Mixture Model (GMM), a straightforward and fascinating model, to automatically segment MR brain images in different brain structures: white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). I’ll first introduce the segmentation problem we are interested in solving, then how a GMM works, and how to fit its parameters to the data. Finally, I’ll show an example of brain segmentation using a GMM.

Segmentation problem

We are given a matrix $ \textbf{D} = \{\textbf{d}_{1}, \cdots , \textbf{d}_{I}\} $ containing the intensities of a brain MR image with $ N $ contrasts and $ I $ voxels. Our goal is to find the segmentation labels $ \textbf{l} = (l_{1}, \cdots, l_{I})^{T} $ for each voxel in the image. Note that for each voxel we have $ K $ possible brain structures to choose, $ l_{i} = \{1, \cdots, K\} $. We can obtain these labels by computing the posterior probability of the labels $ \textbf{l} $ given the data $ \textbf{D} $. Using Bayes rule, we can write:

$$ p(\textbf{l} | \textbf{D}) \propto p(\textbf{D} | \textbf{l}) p(\textbf{l}) $$

where $ p(\textbf{D} | \textbf{l}) $ is the likelihood function and $ p(\textbf{l}) $ is the segmentation prior. Intuitively, our segmentation is proportional to how likely these labels are given the data, times how probable our brain structures occur.

We now need to define our likelihood function and our segmentation prior. It turns out that we can use a Gaussian Mixture Model for explaining both of them. Let’s see how it works!

Gaussian Mixture Model

We assume that a Gaussian Mixture Model generates our data:

$$ p(\textbf{D}) = \prod_{i=1}^{I} \sum_{k=1}^{K} \pi_{k} \mathcal{N} ( \textbf{d}_{i} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k} ), $$

where $ \sum_{k}^{K} \pi_{k}=1 $ with $ 0 \leq \pi_{k} \leq 1 $ is our segmentation prior, while the standard multivariate Gaussian distribution

$$ \mathcal{N} ( \textbf{d}_{i} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k} ) = \frac{1}{\left( 2 \pi \right)^{N/2} |\left( {\boldsymbol{\Sigma}_{k} |} \right)^{1/2}} \exp\left\{ -\frac{1}{2} \left( \textbf{d}_{i} – \boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1} \left( \textbf{d}_{i} – \boldsymbol{\mu}_{k}\right) \right\} $$

is our likelihood function.

Our model parameters $ \boldsymbol{\mu}_{k} $, $ \boldsymbol{\Sigma}_{k} $ and $ \pi_{k} $ are, respectively, the mean, the covariance matrix and the mixture coefficient of the $ k^{th} $ Gaussian. We can think as the mixture coefficient $ \pi_{k} $ as how probable our brain structure $ k $ occurs in the brain.

We have now defined the GMM and its parameters. Now the question is: How can we decide which parameters are the best fit for our input image?

Fitting the parameters to the data

Our goal is to find the parameters that maximize the likelihood function. We can write the log-likelihood function as:

$$ \text{ln} p ( \textbf{D} | \boldsymbol{\mu}, \boldsymbol{\Sigma}, \pi) = \sum_{i}^{I} \text{ln} \{ \sum_{k}^{K} \pi_{k} \mathcal{N} ( \textbf{d}_{i} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k} ) \}. $$

Due to the summation over $k$, if we want to maximize the function, we can’t obtain a closed-form solution. A very compelling way of maximizing the log-likelihood is to use the so-called Expectation-Maximization (EM) algorithm. In this post, I’m not going to give you a formal description of the algorithm, rather an intuitive way of understanding how the EM updates work. For more details, look at this reference.

Let’s start by setting to zero the derivative of the log-likelihood with respect to $ \boldsymbol{\mu} $:

$$ \sum_{i}^{I} \frac{\pi_{k} \mathcal{N}( \textbf{d}_{i} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})}{\sum_{j}^{I} \pi_{j} \mathcal{N} ( \textbf{d}_{i} | \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j} )} \boldsymbol{\Sigma}_{k} ( \textbf{d}_{i} – \boldsymbol{\mu}_{k} ) = 0. $$

Reshuffling the equation we can write

$$ \boldsymbol{\mu}_{k} \gets \frac{\sum_{i}^{I}w_{k}^{i}\textbf{d}_{i}} {N_{k}}, $$

where I defined

$$ w_{k}^{i} = \frac{\pi_{k} \mathcal{N}( \textbf{d}_{i} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})}{\sum_{j}^{I} \pi_{j} \mathcal{N} ( \textbf{d}_{i} | \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j} )} \text{ and } N_{k} = \sum_{i}^{I}w_{k}^{i}. $$

$ N_{k} $ can be interpreted as the number of voxels assigned to each class $ k $, while $ w_{k}^{i} $ is the responsibility of class $ k $ to generate voxel $ i $. We can notice that the mean of the Gaussian distribution associated with the $ k^{th} $ label is the weighted mean of the intensities of the voxels attributed to that label.

We can now use the same principle of reasoning for $ \boldsymbol{\Sigma} $ and $ \pi_{k} $ and obtain

$$ \boldsymbol{\Sigma}_{k} \gets \frac{1}{N_{k}} \sum_{i}^{I} w_{k}^{i} ( \textbf{d}_{i} – \boldsymbol{\mu}_{k} ) ( \textbf{d}_{i} – \boldsymbol{\mu}_{k} )^{T} \text{ and } \pi_{k} \gets \frac{N_{k}}{I} $$

We can think at the update of the covariance of the Gaussian distribution associated with the $ k^{th} $ label as the weighted variance of the intensities of the voxels attributed to that label. The prior for each brain class is instead the fraction of voxels currently assigned to that class.

Note that the update equations that we derived are not in closed form as they depend on the responsibilities $ w_{k}^{i} $. Unfortunately, this does not ensure that our algorithm will converge to a global optimum. Nevertheless, we have theoretical guarantees that the likelihood increases at each iteration of the EM algorithm (See again at this reference).

EM in a nutshell

We can summarize the EM algorithm as composed by these steps:

Initialize the parameters of the GMM ( $ \boldsymbol{\mu}_{k} $, $ \boldsymbol{\Sigma}_{k} $ and $ \pi_{k} $);
Expectation step (E-step): estimate the responsibilities $ w_{k}^{i} $ using the values of the current parameters;
Maximization step (M-step): Re-estimate the parameters using the updates for $ \boldsymbol{\mu}_{k} $, $ \boldsymbol{\Sigma}_{k} $ and $ \pi_{k} $ with the current responsibilities $ w_{k}^{i} $ computed in the E-step;
Iterate between the E-step and M-step until convergence (or a predefined number of iterations).

An example of brain segmentation using a GMM

Let’s now have a look at what a GMM is capable of doing given some brain images. For the following example, I’m using a generated scan from BrainWeb. Here is the image, where I show a slice in axial, coronal, and sagittal view:

Input image: Axial, Coronal and Sagittal view. — Input image: Axial, Coronal, and Sagittal view.

If we look at the histogram of the image, we can see that it has three peaks, most probably representing CSF, GM, and WM:

The fact that the histogram has three peaks is telling us that we can set the number of components of the GMM to 3. If we initialize the GMM with means spread over the image intensity range and with wide variances, we obtain the following:

Image intensities histogram where we superimpose our GMM with our initialization of the parameters.

Even if this initialization is far from perfect, we can already compute our first brain segmentation:

Initial Segmentation with our initialization of the GMM parameters.

We can now use the EM algorithm to find a reasonable estimate of our GMM parameters. If we run the algorithm for 80 iterations, we obtain the following:

Segmentations and superimposed GMM on top of the image histogram every 10 iterations of the EM algorithm. — Segmentations, original image, and superimposed GMM on top of the image histogram for every 10 iterations of the EM algorithm.

This is a reasonable brain segmentation for such a simple model. If we look at the histogram where we superimposed our GMM on top, we can see that the model fits the data quite well.

We can also look at the posterior distribution of each class:

Posterior probability maps of each of the three classes.

We can see that the model has quite accurate probability maps for the three classes. However, we can also notice how the segmentation between WM and GM has some imperfections (top-left axial view and bottom-right sagittal view), due to the bias field introduced by the MRI scanner (you can look at this post on how to model the bias field artifact).

Final thoughts on GMMs

As it shows here, the GMM is a simple, yet powerful model to automatically segment brain images. Even if the example above has only one input contrast, the GMM can adapt to multiple-input contrasts as well as images coming from different scanners.

One of the main limitations of the GMM is that it’s not taking into account the spatial information of the different structures. This can produce anatomically incorrect segmentations.

One idea to overcome this limitation is to use a spatial prior, like a Markov Random Field (MRF) or more complex methods. However, this introduces dependencies between voxels and makes the computation of the posterior distribution a bit more complicated.

Code

An implementation of this model is available on my Github repository (link), where you can replicate the same experiment and obtain the same figures. The code is written in a jupyter notebook that you can run on your laptop in less than a minute.