info prev up next book cdrom email home

Correlation Coefficient

The correlation coefficient is a quantity which gives the quality of a Least Squares Fitting to the original data. To define the correlation coefficient, first consider the sum of squared values ${\rm ss}_{xx}$, ${\rm ss}_{xy}$, and ${\rm ss}_{yy}$ of a set of $n$ data points $(x_i, y_i)$ about their respective means,

$\displaystyle {\rm ss}_{xx}$ $\textstyle \equiv$ $\displaystyle \Sigma (x_i-\bar x)^2 = \Sigma x^2-2\bar x\Sigma x+\Sigma {\bar x}^2$  
  $\textstyle =$ $\displaystyle \Sigma x^2-2n{\bar x}^2+n{\bar x}^2 = \Sigma x^2-n{\bar x}^2$ (1)
$\displaystyle {\rm ss}_{yy}$ $\textstyle \equiv$ $\displaystyle \Sigma (y_i-\bar y)^2 = \Sigma y^2-2\bar y\Sigma y+\Sigma {\bar y}^2$  
  $\textstyle =$ $\displaystyle \Sigma y^2-2n{\bar y}^2+n{\bar y}^2 = \Sigma y^2-n{\bar y}^2$ (2)
$\displaystyle {\rm ss}_{xy}$ $\textstyle \equiv$ $\displaystyle \Sigma (x_i-\bar x)(y_i-\bar y) = \Sigma (x_iy_i-\bar xy_i-x_i\bar y+\bar x\bar y)$  
  $\textstyle =$ $\displaystyle \Sigma xy-n\bar x\bar y-n\bar x\bar y+n\bar x\bar y= \Sigma xy-n\bar x\bar y.$ (3)

For linear Least Squares Fitting, the Coefficient $b$ in
\begin{displaymath}
y=a+bx
\end{displaymath} (4)

is given by
\begin{displaymath}
b={n\sum xy-\sum x\sum y\over n\sum x^2-\left({\sum x}\right)^2} = {{\rm ss}_{xy}\over {\rm ss}_{xx}},
\end{displaymath} (5)

and the Coefficient $b'$ in
\begin{displaymath}
x=a'+b'y
\end{displaymath} (6)

is given by
\begin{displaymath}
b'={n\sum xy-\sum x\sum y\over n\sum y^2-\left({\sum y}\right)^2}.
\end{displaymath} (7)


\begin{figure}\begin{center}\BoxedEPSF{CorrelationCoefficient.epsf}\end{center}\end{figure}

The correlation coefficient $r^2$ (sometimes also denoted $R^2$) is then defined by

\begin{displaymath}
r\equiv \sqrt{bb'} = {n\sum xy-\sum x\sum y\over \sqrt{\left...
...)^2}\right]\left[{n\sum y^2-\left({\sum y}\right)^2}\right]}},
\end{displaymath} (8)

which can be written more simply as
\begin{displaymath}
r^2={{{\rm ss}_{xy}}^2\over {\rm ss}_{xx}{\rm ss}_{yy}}.
\end{displaymath} (9)

The correlation coefficient is also known as the Product-Moment Coefficient of Correlation or Pearson's Correlation. The correlation coefficients for linear fits to increasingly noisy data are shown above.


The correlation coefficient has an important physical interpretation. To see this, define

\begin{displaymath}
A\equiv (\Sigma x^2-n{\bar x}^2)^{-1}
\end{displaymath} (10)

and denote the ``expected'' value for $y_i$ as $\hat y_i$. Sums of $\hat y_i$ are then
$\displaystyle \hat y_i$ $\textstyle =$ $\displaystyle a+bx_i=\bar y-b\bar x+bx_i=\bar x+b(x_i-\bar x)$  
  $\textstyle =$ $\displaystyle A(\bar y\Sigma x^2-\bar x\Sigma xy+x_i\Sigma xy-n\bar x\bar yx_i)$  
  $\textstyle =$ $\displaystyle A[\bar y\Sigma x^2+(x_i-\bar x)\Sigma xy-n\bar x\bar yx_i]$ (11)
$\displaystyle \Sigma \hat y_i$ $\textstyle =$ $\displaystyle A(n\bar y\Sigma x^2-n^2\bar x^2\bar y)$ (12)
$\displaystyle \Sigma {\hat y_i}^2$ $\textstyle =$ $\displaystyle A^2[n\bar y^2(\Sigma x^2)^2-n^2\bar x^2\bar y^2(\Sigma x^2)$  
  $\textstyle \phantom{=}$ $\displaystyle -2n\bar x\bar y(\Sigma xy)(\Sigma x^2)+2n^2\bar x^3\bar y(\Sigma xy)$  
  $\textstyle \phantom{=}$ $\displaystyle +(\Sigma x^2)(\Sigma xy)^2-n\bar x^2(\Sigma xy)]$ (13)
$\displaystyle \Sigma y_i{\hat y}_i$ $\textstyle =$ $\displaystyle A\Sigma[y_i\bar y\Sigma x^2+y_i(x_i-\bar x)\Sigma xy-n\bar x\bar yx_i y_i]$  
  $\textstyle =$ $\displaystyle A[n\bar y^2\Sigma x^2+(\Sigma xy)^2-n\bar x\bar y\Sigma xy-n\bar x\bar y(\Sigma xy)]$  
  $\textstyle =$ $\displaystyle A[n\bar y^2\Sigma x^2+(\Sigma xy)^2-2n\bar x\bar y\Sigma xy].$ (14)

The sum of squared residuals is then
$\displaystyle {\rm SSR}$ $\textstyle \equiv$ $\displaystyle \Sigma({\hat y}_i-\bar y)^2=\Sigma({\hat y_i}^2-2\bar y{\hat y_i}+\bar y^2)$  
  $\textstyle =$ $\displaystyle A^2(\Sigma xy-n\bar x\bar y)^2(\Sigma x^2-n\bar x^2) = {(\Sigma xy-n\bar x\bar y)^2\over \Sigma x^2-n\bar x^2}$  
  $\textstyle =$ $\displaystyle b\,{\rm ss}_{xy} = {{{\rm ss}_{xy}}^2\over {\rm ss}_{xx}}
= {\rm ss}_{yy}r^2=b^2{\rm ss}_{xx},$ (15)

and the sum of squared errors is
$\displaystyle {\rm SSE}$ $\textstyle \equiv$ $\displaystyle \Sigma (y_i-\hat y_i)^2 = \Sigma (y_i-\bar y+b\bar x-bx_i)^2$  
  $\textstyle =$ $\displaystyle \Sigma [y_i-\bar y-b(x_i-\bar x)]^2$  
  $\textstyle =$ $\displaystyle \Sigma (y_i-\bar y)^2+b^2\Sigma (x_i-\bar x)^2-2b\Sigma (x_i-\bar x)(y_i-\bar y)$  
  $\textstyle =$ $\displaystyle {\rm ss}_{yy}+b^2{\rm ss}_{xx}-2b{\rm ss}_{xy}.$ (16)

But
$\displaystyle b$ $\textstyle =$ $\displaystyle {{\rm ss}_{xy}\over {\rm ss}_{xx}}$ (17)
$\displaystyle r^2$ $\textstyle =$ $\displaystyle {{{\rm ss}_{xy}}^2\over {\rm ss}_{xx}{\rm ss}_{yy}},$ (18)

so
$\displaystyle {\rm SSE}$ $\textstyle =$ $\displaystyle {\rm ss}_{yy}+{{{\rm ss}_{xy}}^2\over {{\rm ss}_{xx}}^2} {\rm ss}_{xx}-2{{\rm ss}_{xy}\over {\rm ss}_{xx}}{\rm ss}_{xy}$  
  $\textstyle =$ $\displaystyle {\rm ss}_{yy}-{{{\rm ss}_{xy}}^2\over {\rm ss}_{xx}}$ (19)
  $\textstyle =$ $\displaystyle {\rm ss}_{yy}\left({1-{{{\rm ss}_{xy}}^2\over {{\rm ss}_{xx}}^2}}\right)= {\rm ss}_{yy}(1-r^2)$  
  $\textstyle =$ $\displaystyle {s_y}^2-{s_{\hat y}}^2,$ (20)

and
\begin{displaymath}
{\rm SSE}+{\rm SSR}={\rm ss}_{yy}(1-r^2)+{\rm ss}_{yy}r^2={\rm ss}_{yy}.
\end{displaymath} (21)


The square of the correlation coefficient $r^2$ is therefore given by

\begin{displaymath}
r^2\equiv {{\rm SSR}\over {\rm ss}_{yy}}
= {{{\rm ss}_{xy}...
... x\bar y)^2\over(\Sigma x^2-n\bar x^2)(\Sigma y^2-n\bar y^2)}.
\end{displaymath} (22)

In other words, $r^2$ is the proportion of ${\rm ss}_{yy}$ which is accounted for by the regression.


If there is complete correlation, then the lines obtained by solving for best-fit $(a, b)$ and $(a', b')$ coincide (since all data points lie on them), so solving (6) for $y$ and equating to (4) gives

\begin{displaymath}
y=-{a'\over b'}+{x\over b'} = a+bx.
\end{displaymath} (23)

Therefore, $a=-a'/b'$ and $b=1/b'$, giving
\begin{displaymath}
r^2=bb'=1.
\end{displaymath} (24)


The correlation coefficient is independent of both origin and scale, so

\begin{displaymath}
r(u,v) = r(x,y),
\end{displaymath} (25)

where
$\displaystyle u$ $\textstyle \equiv$ $\displaystyle {x-x_0\over h}$ (26)
$\displaystyle v$ $\textstyle \equiv$ $\displaystyle {y-y_0\over h}.$ (27)


See also Correlation Index, Correlation Coefficient--Gaussian Bivariate Distribution, Correlation Ratio, Least Squares Fitting, Regression Coefficient


References

Acton, F. S. Analysis of Straight-Line Data. New York: Dover, 1966.

Kenney, J. F. and Keeping, E. S. ``Linear Regression and Correlation.'' Ch. 15 in Mathematics of Statistics, Pt. 1, 3rd ed. Princeton, NJ: Van Nostrand, pp. 252-285, 1962.

Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, 1993.

Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; and Vetterling, W. T. ``Ninear Correlation.'' §14.5 in Numerical Recipes in FORTRAN: The Art of Scientific Computing, 2nd ed. Cambridge, England: Cambridge University Press, pp. 630-633, 1992.



info prev up next book cdrom email home

© 1996-9 Eric W. Weisstein
1999-05-25