# Logistic Regression in R with Single-Trial Data

The examples in this post is taken from a class I took at the University of Missouri. These examples are originally provided in SAS and I translated them to R.

The relationship between countries’ credit ratings and the volatility of the countries’ stock markets was examined in the *Journal of Portfolio Management* (Spring 1996). Our interest lies in whether volatility (standard deviation of stock returns) and credit rating (in percent) can be used to classify a country into one of the two market types (developed or emerging).

## Descriptive analysis

The data is read into a `data.frame`

with the response as a factor. It’s customary and preferable to use a classification variable as a factor and let R do the rest of the things. When the factor is created from character data like we do in this example, the default order of the levels of that factor are alphabetical. It can be changed so that the level in higher order are coded with larger number. In `glm()`

, it always model the level with higher order as the event. Therefore here we changed the order of levels so that in glm we can model the probability of the response being a developed country.

```
volatile <- read.table("datasets/volatile.dat")
names(volatile) <- c('country/region', 'volatile', 'credit', 'market')
volatile$market <- factor(volatile$market, levels = c('emerge', 'develop'))
kable(volatile)
```

country/region | volatile | credit | market |
---|---|---|---|

Afghanistan | 55.7 | 8.3 | emerge |

Australia | 23.9 | 71.2 | develop |

China | 27.2 | 57.0 | emerge |

Cuba | 55.0 | 8.7 | emerge |

Germany | 20.3 | 90.9 | develop |

France | 20.6 | 89.1 | develop |

India | 30.3 | 46.1 | emerge |

Belgium | 22.3 | 79.2 | develop |

Canada | 22.1 | 80.3 | develop |

Ethiopia | 47.9 | 14.1 | emerge |

Haiti | 54.9 | 8.8 | emerge |

Japan | 20.2 | 91.6 | develop |

Libya | 36.7 | 30.0 | emerge |

Malaysia | 24.3 | 69.1 | emerge |

Mexico | 31.8 | 41.8 | emerge |

NewZealand | 24.3 | 69.4 | develop |

Nigeria | 46.2 | 15.8 | emerge |

Oman | 28.6 | 51.8 | develop |

Panama | 38.6 | 26.4 | emerge |

Spain | 23.4 | 73.7 | develop |

Sudan | 60.5 | 6.0 | emerge |

Taiwan | 22.2 | 79.9 | develop |

Norway | 21.4 | 84.6 | develop |

Sweden | 23.3 | 74.1 | develop |

Togo | 45.1 | 17.0 | emerge |

Ukraine | 46.3 | 15.7 | emerge |

UnitedKingdom | 20.8 | 87.8 | develop |

UnitedStates | 20.3 | 90.7 | develop |

Vietnam | 36.9 | 29.5 | emerge |

Zimbabwe | 36.2 | 31.0 | emerge |

## Logistic regression

In R, logistic regression model is fitted by `glm()`

with binomial distribution family. The default link function for binomial family is the logit link.

```
fit <- glm(market ~ volatile + credit, family = "binomial", data = volatile)
fit
```

```
##
## Call: glm(formula = market ~ volatile + credit, family = "binomial",
## data = volatile)
##
## Coefficients:
## (Intercept) volatile credit
## -11.56273 0.04501 0.17385
##
## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual
## Null Deviance: 41.46
## Residual Deviance: 9.254 AIC: 15.25
```

### Model fit results

```
# response profile
summary(volatile$market)
```

```
## emerge develop
## 16 14
```

In order to compute the model fit statistics for an intercept only model, we can just fit the model explicitly and use the generic functions.

```
fit2 <- glm(market ~ 1, family = "binomial", data = volatile)
AIC(fit2, fit)
```

```
## df AIC
## fit2 1 43.45540
## fit 3 15.25401
```

`BIC(fit2, fit)`

```
## df BIC
## fit2 1 44.85660
## fit 3 19.45761
```

`-2 * logLik(fit2)`

`## 'log Lik.' 41.4554 (df=1)`

`-2 * logLik(fit)`

`## 'log Lik.' 9.254014 (df=3)`

In the SAS output, the same model is fitted using `PROC GENMOD`

. There are some additional output accessing goodness of fit. Here we compute some of the statistics.

```
# Deviance
fit$deviance
```

`## [1] 9.254014`

`fit$df.residual`

`## [1] 27`

`fit$deviance / fit$df.residual`

`## [1] 0.3427413`

```
# Pearson's Chi-Square
sum(residuals(fit, "pearson")^2)
```

`## [1] 9.81283`

`sum(residuals(fit, "pearson")^2) / fit$df.residual`

`## [1] 0.3634381`

These all show evidence of under-dispersion.

### Hypothesis testing

We know the likelihood ratio test is just the \(-2log\Lambda\), we can use the information above to calculate it.

`-2 * as.numeric(logLik(fit2) - logLik(fit))`

`## [1] 32.20138`

`1 - pchisq(-2 * as.numeric(logLik(fit2) - logLik(fit)), df = 2) # p-value`

`## [1] 1.017556e-07`

For Wald test \(H_0: L\theta = 0\), we need to specify L to compute the test statistic. For example if we want to test \(H_0: \beta_1 = \beta_2 = 0\), we can set L in the following way:

```
L <- rbind(c(0, 1, 0), c(0, 0, 1))
L
```

```
## [,1] [,2] [,3]
## [1,] 0 1 0
## [2,] 0 0 1
```

There is a function `wald.test`

from the *aod* package. Note that for a Wald test, we need the variance of the parameter estimates, which is \(a(\phi)(\hat{F}\hat{V}\hat{F})^{-1}\). This matrix can be calculated using the `vcov()`

function. It’s calculated by QR decomposition, so the result is slightly different than those from SAS.

```
library(aod)
wald.test(vcov(fit), b = coef(fit), L = L)
```

```
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.0, df = 2, P(> X2) = 0.049
```

In SAS, the analysis of maximum likelihood estimates is done by Wald test and a Chi-square distribution with degree of freedom 1. This is the same as the a Z-test with a standard normal distribution. The Z statistic is the square root of the Wald Chi-square statistic and the p-values are the same.

### Concordance

The concordant and discordant percent all requires the fitted value to be calculated. Note that when we are modeling the probability for `develop`

because it is the second level in the variable `market`

.

```
i <- volatile$market == 'develop'
j <- volatile$market == 'emerge'
pairs <- sum(i) * sum(j)
# expand pairs into grid
grid <- with(volatile,
cbind(expand.grid(yi = market[i], yj = market[j]),
expand.grid(pi = fitted(fit)[i], pj = fitted(fit)[j])))
# Percent Concordant
sum(grid$pi > grid$pj) / pairs
```

`## [1] 0.9910714`

```
# Percent Discordant
sum(grid$pi < grid$pj) / pairs
```

`## [1] 0.008928571`

```
# Somers' D
nc <- sum(grid$pi > grid$pj)
(nc - (pairs - nc)) / pairs
```

`## [1] 0.9821429`

```
# Tau-a
2 * (nc - (pairs - nc)) / (nobs(fit) * (nobs(fit) - 1))
```

`## [1] 0.5057471`

### Parameter estimates and confidence intervals

`summary(fit)$coefficients`

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.56273217 40.9868601 -0.2821083 0.7778605
## volatile 0.04500712 0.9463449 0.0475589 0.9620678
## credit 0.17384805 0.2632646 0.6603547 0.5090262
```

The estimates of odds ratio is obtained by applying the exponential function to the parameter estimates and confidence intervals.

```
# profile likelihood confidence interval
confint(fit)
```

`## Waiting for profiling to be done...`

```
## 2.5 % 97.5 %
## (Intercept) -57.2306037 NA
## volatile NA 0.8767002
## credit -0.8975829 0.5591083
```

`exp(cbind(OR = coef(fit)[-1], confint(fit)[-1, ])) # odds ratio`

`## Waiting for profiling to be done...`

```
## OR 2.5 % 97.5 %
## volatile 1.046035 NA 2.402957
## credit 1.189875 0.4075536 1.749112
```

```
# Wald confidence intervals
confint.default(fit)
```

```
## 2.5 % 97.5 %
## (Intercept) -91.8955018 68.7700374
## volatile -1.8097948 1.8998090
## credit -0.3421412 0.6898373
```

`exp(coef(fit)[-1]) # odds ratio`

```
## volatile credit
## 1.046035 1.189875
```

`exp(confint.default(fit)[-1, ])`

```
## 2.5 % 97.5 %
## volatile 0.1636877 6.684618
## credit 0.7102479 1.993391
```

### Classification table

SAS output have a nice classification table that shows the classification results under different cutoffs. Thanks to this answer, R can achieve a similar table as well, though there is a little bit difference due to precision (the maximum absolute difference between the two predicted probabilities are 1.890711e-08).

```
getMisclass <- function(cutoff, p, event) {
pred <- factor(1*(p > cutoff), labels = levels(event))
t <- table(pred, event)
cat("cutoff ", cutoff, ":\n")
print(t)
cat("correct :", round(sum(t[c(1,4)])/sum(t), 3),"\n")
cat("Specificity:", round(t[1]/sum(t[,1]), 3),"\n")
cat("Sensitivity:", round(t[4]/sum(t[,2]), 3),"\n\n")
invisible(t)
}
cutoffs <- seq(0.1, .9, by=.1)
getMisclass(.5, p = fitted(fit), event = volatile$market)
```

```
## cutoff 0.5 :
## event
## pred emerge develop
## emerge 15 1
## develop 1 13
## correct : 0.933
## Specificity: 0.938
## Sensitivity: 0.929
```

You can apply this function to get a more thorough output like SAS.

`sapply(cutoffs, getMisclass, p=fitted(fit), event=volatile$market)`