What is the AIC?

The AIC or Akaike Information Criterion is an estimator of prediction error or in other words quality of a statistical model. It is calculated using the following formula:

\[ AIC = 2k - 2\textrm{ln}(\hat{L}) \]

where \(k\) is the number of parameters in the model (e.g. the number of variables you have chosen to include in your model as predictors) and \(\hat{L}\) be the maximum value of the likelihood function for the model. We won’t go into too much detail regarding \(\hat{L}\), all we need to know here is that \(\hat{L}\) is a measure of goodness of fit for the model and that when comparing models we aim for the model with the largest maximum likelihood.

Interpretting the AIC

Let’s break down the equation for the AIC:

\(2k\) - this one is easy to understand; essentially the larger the value of \(k\), the larger the value of \(2k\). Thus adding more parameters to your model will increase the value of AIC.
\(-2\textrm{ln}(\hat{L})\) - this one is a little more tricky to understand. If we were to compare models based on the maximum likelihood value alone, the best model will be the one with the largest maximum likelihood. Now, the natural log function is a monotonically increasing function, which essentially means that the larger the value of \(\hat{L}\), the larger the value of \(\textrm{ln}(\hat{L})\) and this is true everywhere. This means that the better model should have the larger value of \(\textrm{ln}(\hat{L})\) and thus the smaller value of \(-2\textrm{ln}(\hat{L})\). Thus the model with the best goodness of fit will decrease the value of the AIC.

Note: The likelihood function can take any value strictly greater than 0, thus the log likelihood will always exist. It is possible too, that the log likelihood can be negative, thereby making \(-2\textrm{ln}(\hat{L})\) positive. It is also possible that \(-2\textrm{ln}(\hat{L})\) will be negative and is larger than \(2k\), therefore in some cases the AIC can also be negative.

Thus the AIC aims to penalise more parameters in your model (this discourages overfitting) but reward goodness of fit. For example, consider the following scenarios:

Model 1: \(k = 5, \hat{L} = 3 \implies AIC = 7.8\),
Model 2: \(k = 2, \hat{L} = 3 \implies AIC = 1.8\),

Here, if we choose the model which minimises AIC, we choose Model 2 which has fewer parameters.

Model 1: \(k = 5, \hat{L} = 3 \implies AIC = 7.8\),
Model 2: \(k = 5, \hat{L} = 5 \implies AIC = 6.8\),
Here, if we choose the model which minimises AIC, we choose Model 2 with the larger of the two maximum likelihoods.

Using AIC for Model Selection

Indeed it is true that the general rule of thumb for using AIC for model selection is the smaller the better. But how much smaller is actually smaller? What if there is a much simpler model with slightly larger AIC than the “best” model? How can we judge how much worse that model is based on the AIC alone?

This paper by Burnham and Anderson (2004):

http://faculty.washington.edu/skalski/classes/QERM597/papers_xtra/Burnham and Anderson.pdf

suggests to consider the strength of other models based on the difference in the AIC between the “best” model and all others. We consider the difference to be

\[ \Delta_i = AIC_i - AIC_\textrm{min} \]

where \(AIC_i\) is the AIC for the \(i\)th model and \(AIC_\textrm{min}\) is the lower AIC across all models.

Burnham and Anderson then suggest the following rule of thumb:

if \(\Delta_i < 2\), then there is substantial support for the \(i\)th model or rather there is little evidence against it;
if \(4 < \Delta_i < 7\), there is very little support for the \(i\)th model;
if \(\Delta_i \geq 10\), there is essentially no support for the \(i\)th model.

Thus, for example if Model 2 is the “best” model, but Model 1 has an AIC only 1 larger than the smallest AIC, then there is reasonable evidence to suggest that Model 1 would perform just as good.

Comparing AICs

Alison Telford

2023-01-10

What is the AIC?

Interpretting the AIC

Using AIC for Model Selection