StatconWhy we need the intercept

Why we need the intercept...

This is the first article in the series on non-hierarchical regression models. In this article we discuss the pitfalls when performing a regression without using the intercept term.

A common question is, if the intercept-term may be removed from a regression analysis in case it is not significant. Most of the time the answer to this question should be "No!" and I totally agree with that answer. But of course we would like to understand why!

We will use the following terminology:

A full model is a model containing the intercept and main effects: $y = b_{0} + b_{1} \cdot x + ϵ$
A no-intercept-model reduces the full model by its intercept: $y = b_{1} \cdot x + ϵ$
The intercept-only-model(often: baseline model) uses only the intercept to explain the data: $y = b_{1} + ϵ$

Problems of no-intercept-models

$R^{2}$ is not useful any more.
The slope estimators might be biased.

Explanation

The first commonly mentioned problem (see [1.]) is that the $R^{2}$ -statistic is not useful anymore if the intercept is not included in a linear regression model. Usually $R^{2}$ is interpreted as the amount of variation that is explained by the model.

R^{2} = \frac{M o d e l S S}{T o t a l S S} = 1 - \frac{R e s i d u a l S S}{T o t a l S S}

More precise: The full model is compared to a reduced model, which in this case is the intercept-only-model. So what do we do for an no-intercept-model? Well we will not compare it to the intercept-only-model. Both models are completely independent from each other so there is no sense in comparing. Instead most software packages (R, JMP, SAS, DX, SPSS,...) compare the model without intercept to a reference model that has lower order. That is a model with no intercept and no other effects.

Noise-Model : y = ϵ

One might call it a noise-model. Of course this is no real model (it does not explain anything) and any comparison with it is not very useful. Actually there is no real reference model we could use to compare our no-intercept-model with. So there is no interpretable $R^{2}$ for models w/o intercept.

None the less statistics software will present an $R^{2}$ for no-intercept models. But as the interpretation of the $R^{2}$ is lost we cannot use it to evaluate the model quality. One can even show that the $R^{2}$ of a no-intercept model will usually be higher compared to the $R^{2}$ of a full model (see the mathematical details for that).

For the example presented in the graph below the $R^{2}$ (calculated with R) are:

	$R^{2}$
Full Model	0.7846
No-Intercept-Model	0.9114

It is obvious that this is not a reasonable result. The red line is clearly not the better model!

The second problem that arises is that the least squares estimator for the slopes in a no-intercept model are biased (systematically shifted towards larger or smaller values).

Usual (teal) and regression without intercept (red)

With removing the intercept from the model we impose a restriction so that the regression line goes through the origin (x=0;y=0). The graph shows what happens to the regression line. The blue line is the common regression line the red line is the no-intercept-regression-line. It is heavily pulled down because it has to go through $(0; 0)$ .

Mathematical Details

R^2 does not work for no-intercept-models

So what exactly happens to $R^{2}$ ? As noted above the typical $R^{2}$ is defined as: $R^{2} = \frac{\sum_{i} ({\hat{y}}_{i} - \bar{y})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}} = 1 - \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}}$

Formulate the $R_{0}^{2}$ for a no-intercept-model $y = b \cdot x$ :

R_{0}^{2} = \frac{\sum_{i} {\hat{y}}_{i}^{2}}{\sum_{i} y_{i}^{2}} = 1 - \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} y_{i}^{2}}

At the end of the last equation you see that we compare the Residual-Sum-of-Squares with the sum-of-squares of the actual observations. So this is a ratio of the variation of the data around the model (Residual sum of squares) and the magnitude of the response.

Apart from not really being interpretable $R_{0}^{2}$ is often larger than $R^{2}$ . This often leads to the wrong (!) assumption that the model w/o intercept is better in explaining the data than the full model. So how can $R_{0}^{2}$ be greater than $R^{2}$ ? Usually we expect $R^{2}$ to become greater whenever we add more parameters to the model. Now we reduce the model and $R^{2}$ raises?

Use $\tilde{y_{i}}$ for the fitted values of the no-intercept model and ${\hat{y}}_{i}$ for the fitted values of the full model. Then $R_{0}$ is greater than $R^{2}$ whenever:

R_{0}^{2} = 1 - \frac{\sum_{i} (y_{i} - {\tilde{y}}_{i})^{2}}{\sum_{i} y_{i}^{2}} > R^{2} = 1 - \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}}

\Rightarrow \frac{\sum_{i} (y_{i} - {\tilde{y}}_{i})^{2}}{\sum_{i} y_{i}^{2}} < \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}}

\Rightarrow \frac{| | y - \tilde{y} | |_{2}^{2}}{| | y - \hat{y} | |_{2}^{2}} < \frac{| | y | |_{2}^{2}}{| | y - \bar{y} | |_{2}^{2}}

Now use $| | y | |_{2}^{2} = | | y - \bar{y} + \bar{y} | |_{2}^{2} = | | y - \bar{y} | |_{2}^{2} + n {\bar{y}}^{2}$ . Then:

\frac{| | y - \tilde{y} | |_{2}^{2}}{| | y - \hat{y} | |_{2}^{2}} < \frac{| | y - \bar{y} | |_{2}^{2} + n {\bar{y}}^{2}}{| | y - \bar{y} | |_{2}^{2}}

\frac{| | y - \tilde{y} | |_{2}^{2}}{| | y - \hat{y} | |_{2}^{2}} < 1 + \frac{{\bar{y}}^{2}}{\frac{1}{n} | | y - \bar{y} | |_{2}^{2}}

The left hand side is allways greater than 1, as the fitted values of the no-intercept model $\tilde{y}$ will always be worse than the fitted values of the full model $\hat{y}$ . The last term of the right hand side is large if the squared mean response is greater than the variance of the response. So $R_{0}^{2}$ will be larger than $R^{2}$ whenever the mean of the response $\bar{y}$ is much larger than the standard deviation of the response $\sqrt{\frac{1}{n} \sum_{i} (y_{i} - y)^{2}}$ (forget about the $\frac{1}{n - 1}$ for being unbiased :-)).

Proof that LS-estimator is biased in no-intercept models

Say the true data generating process is: $y_{i} = b_{0} + b \cdot x_{i} + ϵ_{i}$ . Our estimated model is $y_{i} = b \cdot x_{i} + ϵ_{i}$ . Then we get the expectation of $b$ :

E [b] = E [(X^{T} X)^{- 1} X^{T} y]

= E [(X^{T} X)^{- 1} X^{T} (b_{0} + X b + ϵ)]

= E [(X^{T} X)^{- 1} X^{T} b_{0} + \underset{= 1}{\underset{⏟}{(X^{T} X)^{- 1} X^{T} X}} b + (X^{T} X)^{- 1} X^{T} ϵ]

If $E [ϵ] = 0$ as assumed in linear regression:

= E [(X^{T} X)^{- 1} X^{T} b_{0}] + b + 0

Obviously $b$ is biased if neither $(X^{T} X)^{- 1} X^{T}$ nor $b_{0}$ are equal to 0. See that this applies even when $b_{0}$ is not significant different from 0!

When to use a no-intercept regression

Basically there is only one reason to perform a regression without using the intercept: Whenever your model is used to describe a process which is known to have a zero-intercept. Examples will be presented in the last article of this series.

So stay tuned!

Literature

[1.] $R^{2}$ -problem on CrossValidated: Link.

[2.]William Greene: Econometrics (Link, Biasedness of the OLS-estimator for omitted variables in Sections 4.3.2)

Autor: Sebastian Hoffmeister

Why we need the intercept

Why we need the intercept...

Problems of no-intercept-models

Explanation

INFORMATIONEN

STATCON GmbH

ZAHLUNGSWEISEN