Advanced | Help | Encyclopedia Directory

# Mutual information

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the independence of the two variables. The unit of measurement of mutual information is the bit.

Informally, mutual information measures the information of X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.

Formally, in the discrete case, if the joint probability density function (with respect to counting measure, since this is the discrete case) of X and Y is p with p(x, y) = Pr(X=x, Y=y), the probability density function of X alone is f with f(x) = Pr(X=x), and the probability density function of Y alone is g with g(y) = Pr(Y=y), then the mutual information of X and Y is given by I(X, Y), defined as follows for the discrete case:

[itex] I(X,Y) = \sum_{x,y} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)}. \![itex]

In the continuous case, replace summation by a definite double integral:

[itex] I(X,Y) = \int_{(-\infty,\infty) \times (-\infty,\infty)} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)} \; d(x,y). \![itex]

Mutual information is a measure of independence in the following sense: I(X, Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = f(x) × g(y), and therefore:

[itex] \log \frac{p(x,y)}{f(x)\,g(y)} = \log 1 = 0. \![itex]

Moreover, mutual information is nonnegative (i.e. I(X,Y) ≥ 0; see below) and symmetric (i.e. I(X,Y) = I(Y,X)).

Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.

## Relation to other quantities

Mutual information can be equivalently expressed as

[itex] I(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) = H(X) + H(Y) – H(X,Y) [itex]

where H(X) and H(Y) are entropies, H(X|Y) and H(Y|X) are conditional entropies, and H(Y,X) is the joint entropy between X and Y.

Since H(X) > H(X|Y), this characterization is consistent with the nonnegativity property stated above.

Mutual information can also be expressed in terms of the Kullback-Leibler divergence between the joint distribution of two random variables X and Y and the product of their marginal distributions. Let q(x, y) = f(x) × g(y); then

[itex] I(X,Y) = \mathit{KL}(p, q). [itex]

Furthermore, let hy(x) = p(x, y) / g(y). Then

[itex] I(X,Y) = \sum_y g(y) \sum_x h_y(x) \times \log_2 \frac{h_y(x)}{f(x)} \![itex]
[itex] = \sum_y g(y) \; \mathit{KL}(h_y,f) \![itex]
[itex] = \mathrm{E}_Y[\mathit{KL}(h_y,f)]. \![itex]

Thus mutual information can also be understood as the expectation of the Kullback-Leibler divergence between the conditional distribution h of X given Y and the univariate distribution f of X: the more different the distributions f and h, the greater the information gain.

## Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:

• Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
• Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.

## References

• Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)
• Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989. http://www.aclweb.org/anthology/P89–1010

 Links: Add URL | About Slider | FREE Slider Toolbar - Simply Amazing Copyright © 2000-2008 Slider.com. All rights reserved. Content is distributed under the GNU Free Documentation License.