What is the AIC?

The AIC or Akaike Information Criterion is an estimator of prediction error or in other words quality of a statistical model. It is calculated using the following formula:

\[ AIC = 2k - 2\textrm{ln}(\hat{L}) \]

where \(k\) is the number of parameters in the model (e.g. the number of variables you have chosen to include in your model as predictors) and \(\hat{L}\) be the maximum value of the likelihood function for the model. We won’t go into too much detail regarding \(\hat{L}\), all we need to know here is that \(\hat{L}\) is a measure of goodness of fit for the model and that when comparing models we aim for the model with the largest maximum likelihood.

Interpretting the AIC

Let’s break down the equation for the AIC:

Note: The likelihood function can take any value strictly greater than 0, thus the log likelihood will always exist. It is possible too, that the log likelihood can be negative, thereby making \(-2\textrm{ln}(\hat{L})\) positive. It is also possible that \(-2\textrm{ln}(\hat{L})\) will be negative and is larger than \(2k\), therefore in some cases the AIC can also be negative.

Thus the AIC aims to penalise more parameters in your model (this discourages overfitting) but reward goodness of fit. For example, consider the following scenarios:

Model 1: \(k = 5, \hat{L} = 3 \implies AIC = 7.8\),
Model 2: \(k = 2, \hat{L} = 3 \implies AIC = 1.8\),

Here, if we choose the model which minimises AIC, we choose Model 2 which has fewer parameters.

Model 1: \(k = 5, \hat{L} = 3 \implies AIC = 7.8\),
Model 2: \(k = 5, \hat{L} = 5 \implies AIC = 6.8\),
Here, if we choose the model which minimises AIC, we choose Model 2 with the larger of the two maximum likelihoods.

Using AIC for Model Selection

Indeed it is true that the general rule of thumb for using AIC for model selection is the smaller the better. But how much smaller is actually smaller? What if there is a much simpler model with slightly larger AIC than the “best” model? How can we judge how much worse that model is based on the AIC alone?

This paper by Burnham and Anderson (2004):

http://faculty.washington.edu/skalski/classes/QERM597/papers_xtra/Burnham and Anderson.pdf

suggests to consider the strength of other models based on the difference in the AIC between the “best” model and all others. We consider the difference to be

\[ \Delta_i = AIC_i - AIC_\textrm{min} \]

where \(AIC_i\) is the AIC for the \(i\)th model and \(AIC_\textrm{min}\) is the lower AIC across all models.

Burnham and Anderson then suggest the following rule of thumb:

Thus, for example if Model 2 is the “best” model, but Model 1 has an AIC only 1 larger than the smallest AIC, then there is reasonable evidence to suggest that Model 1 would perform just as good.