This was motivated by an example from Norm Matloff’s book on regression, “Statistical Regression and Classification”, cited in this Tweet. That example showed that a linear model can sometimes predict count data better than a Poisson model. What I want to show here is that this is in part due to the implicit assumption about functional shapes: the log-link assumes that, for linear relationships, if the relationship between variable and the mean is positive, then it is an exponentially increasing relationship. However, I argue that it’s better to relax the linearity assumption inside the Poisson model rather than playing with link functions to indirectly change the functional shape assumption.

Getting set up

Loading packages. The pima data is in the pdp package, and mgcv is for GAMs w/ nonlinear functional relationships.

library(mgcv)
library(pdp)

Also, I’ll set the number of digits to be printed to a reasonable value for comparing \(R^2\) values.

options(digits=2)

Note: I am using the pima data set for comparison with the example from “Statistical Regression and Classification” only. This data set has been withdrawn from the UCI Machine Learning Repository, as the original people the data is from did not consent to having their personal medical histories made available for testing machine learning approaches (see this article for context). I am not able to remove this data from the multiple R packages that it is on, but I do not recommend using this data for other example analyses.

View the data

I’ll take a look at the pima data. Note that “insulin”, the 2-hour insulin serum levels, has many NAs, so I’ll exclude it as a predictor from the linear models.

head(pima)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA   34     0.63  50      pos
## 2        1      85       66      29      NA   27     0.35  31      neg
## 3        8     183       64      NA      NA   23     0.67  32      pos
## 4        1      89       66      23      94   28     0.17  21      neg
## 5        0     137       40      35     168   43     2.29  33      pos
## 6        5     116       74      NA      NA   26     0.20  30      neg

Note that this differs from the data set used in Norm’s example: in the data set used in his book, insulin values of NA were set equal to zero. I’ll stick with treating them as NAs here.

Fitting linear models:

I’ll fit two models, to be consistent with those fit in Norm Matloff’s book. The first will be a Poisson regression, and the second will be a model assuming a linear (identity) link but modelling heteroskedasticity with a quasi-family (assuming variance increases proportionately to the mean).

Poisson model:

This is the basic Poisson glm with a log-link function, predicting pregnancy number as a function of the predictor variables. Note that I’m using the gam function here, but since all of the variables are fixed effects, this is really just fitting a standard glm. The benefit of gam, though, is that it calculates \(R^2\) as one of its outputs (glm does not for non-normal data).

preg_pois <- gam(pregnant~age+glucose+pressure+triceps+mass+pedigree+diabetes,
                 data=pima, 
                 family= poisson,
                 na.action = na.exclude)

Quasi model:

preg_quasi <- gam(pregnant~age+glucose+pressure+triceps+mass+pedigree+diabetes,
                  data=pima, 
                  family= quasi(variance = "mu^2"),
                 na.action = na.exclude)

Comparing predictions

These models end up with different estimated effects and patterns of significance: the Poisson model found that glucose, age, and diabetes state were significant predictors of number of pregnancies, whereas the quasi model found that tricepts (skin fold thickness), mass, and age were all significant predictors. However, what we are interested in is the predictive power of the two models. Let’s then compare their \(R^2\) values:

print(summary(preg_pois)$r.sq)
## [1] 0.29
print(summary(preg_quasi)$r.sq)
## [1] 0.36

As in the example from the book, the \(R^2\) for the quasi model is higher than the Poisson model.

Nonlinear models

Both models found age to be a significant positive predictor of number of pregnancies. This should be unsurprising, for very obvious reasons. However, biologically, it does not make sense for mean number of pregnancies to increase either linearly or exponentially with age; instead, we would expect a saturating function: at some point, people become too old to have more children. This is a very natural place for a GAM model. We’ll adjust both models with a nonlinear term for age, and test:

preg_pois_nonlin <- gam(pregnant~s(age)+glucose+pressure+triceps+mass+pedigree+diabetes, 
                        data=pima, 
                        family= poisson,
                        na.action = na.exclude)

preg_quasi_nonlin <- gam(pregnant~s(age)+glucose+pressure+triceps+mass+pedigree+diabetes, 
                        data=pima, 
                        family= quasi(variance = "mu^2"),
                        na.action = na.exclude)

We can plot the nonlinear functional response from both of these models (note that for the Poisson model, the curve shows the fitted relationship on the link scale).

par(mfrow=c(1,2))

plot(preg_pois_nonlin, main = "Poisson model")
plot(preg_quasi_nonlin, main = "Quasi model")

Interestingly, the Poisson model estimates a saturating functional response, whereas the quasi model still estimates a linear function. I’m not actually sure why in this case (although there are a lot of warnings cropping up when fitting the quasi model).

We can compare the \(R^2\) of both models easily. Let’s see which does better:

print(summary(preg_pois_nonlin)$r.sq)
## [1] 0.5
print(summary(preg_quasi_nonlin)$r.sq)
## [1] 0.38

The Poisson model has a significantly higher \(R^2\) than either of the linear models or the nonlinear quasi model. Here we can see that adding in an appropriate nonlinear term when biologically justified resulted in a significant increase in predictive ability (in sample).