# Logistic Regression in R with Single-Trial Data

The examples in this post is taken from a class I took at the University of Missouri. These examples are originally provided in SAS and I translated them to R.

The relationship between countries’ credit ratings and the volatility of the countries’ stock markets was examined in the Journal of Portfolio Management (Spring 1996). Our interest lies in whether volatility (standard deviation of stock returns) and credit rating (in percent) can be used to classify a country into one of the two market types (developed or emerging).

## Descriptive analysis

The data is read into a data.frame with the response as a factor. It’s customary and preferable to use a classification variable as a factor and let R do the rest of the things. When the factor is created from character data like we do in this example, the default order of the levels of that factor are alphabetical. It can be changed so that the level in higher order are coded with larger number. In glm(), it always model the level with higher order as the event. Therefore here we changed the order of levels so that in glm we can model the probability of the response being a developed country.

volatile <- read.table("datasets/volatile.dat")
names(volatile) <- c('country/region', 'volatile', 'credit', 'market')
volatile$market <- factor(volatile$market, levels = c('emerge', 'develop'))
kable(volatile)
country/region volatile credit market
Afghanistan 55.7 8.3 emerge
Australia 23.9 71.2 develop
China 27.2 57.0 emerge
Cuba 55.0 8.7 emerge
Germany 20.3 90.9 develop
France 20.6 89.1 develop
India 30.3 46.1 emerge
Belgium 22.3 79.2 develop
Canada 22.1 80.3 develop
Ethiopia 47.9 14.1 emerge
Haiti 54.9 8.8 emerge
Japan 20.2 91.6 develop
Libya 36.7 30.0 emerge
Malaysia 24.3 69.1 emerge
Mexico 31.8 41.8 emerge
NewZealand 24.3 69.4 develop
Nigeria 46.2 15.8 emerge
Oman 28.6 51.8 develop
Panama 38.6 26.4 emerge
Spain 23.4 73.7 develop
Sudan 60.5 6.0 emerge
Taiwan 22.2 79.9 develop
Norway 21.4 84.6 develop
Sweden 23.3 74.1 develop
Togo 45.1 17.0 emerge
Ukraine 46.3 15.7 emerge
UnitedKingdom 20.8 87.8 develop
UnitedStates 20.3 90.7 develop
Vietnam 36.9 29.5 emerge
Zimbabwe 36.2 31.0 emerge

## Logistic regression

In R, logistic regression model is fitted by glm() with binomial distribution family. The default link function for binomial family is the logit link.

fit <- glm(market ~ volatile + credit, family = "binomial", data = volatile)
fit
##
## Call:  glm(formula = market ~ volatile + credit, family = "binomial",
##     data = volatile)
##
## Coefficients:
## (Intercept)     volatile       credit
##   -11.56273      0.04501      0.17385
##
## Degrees of Freedom: 29 Total (i.e. Null);  27 Residual
## Null Deviance:       41.46
## Residual Deviance: 9.254     AIC: 15.25

### Model fit results

# response profile
summary(volatile$market) ## emerge develop ## 16 14 In order to compute the model fit statistics for an intercept only model, we can just fit the model explicitly and use the generic functions. fit2 <- glm(market ~ 1, family = "binomial", data = volatile) AIC(fit2, fit) ## df AIC ## fit2 1 43.45540 ## fit 3 15.25401 BIC(fit2, fit) ## df BIC ## fit2 1 44.85660 ## fit 3 19.45761 -2 * logLik(fit2) ## 'log Lik.' 41.4554 (df=1) -2 * logLik(fit) ## 'log Lik.' 9.254014 (df=3) In the SAS output, the same model is fitted using PROC GENMOD. There are some additional output accessing goodness of fit. Here we compute some of the statistics. # Deviance fit$deviance
## [1] 9.254014
fit$df.residual ## [1] 27 fit$deviance / fit$df.residual ## [1] 0.3427413 # Pearson's Chi-Square sum(residuals(fit, "pearson")^2) ## [1] 9.81283 sum(residuals(fit, "pearson")^2) / fit$df.residual
## [1] 0.3634381

These all show evidence of under-dispersion.

### Hypothesis testing

We know the likelihood ratio test is just the $$-2log\Lambda$$, we can use the information above to calculate it.

-2 * as.numeric(logLik(fit2) - logLik(fit))
## [1] 32.20138
1 - pchisq(-2 * as.numeric(logLik(fit2) - logLik(fit)), df = 2)  # p-value
## [1] 1.017556e-07

For Wald test $$H_0: L\theta = 0$$, we need to specify L to compute the test statistic. For example if we want to test $$H_0: \beta_1 = \beta_2 = 0$$, we can set L in the following way:

L <- rbind(c(0, 1, 0), c(0, 0, 1))
L
##      [,1] [,2] [,3]
## [1,]    0    1    0
## [2,]    0    0    1

There is a function wald.test from the aod package. Note that for a Wald test, we need the variance of the parameter estimates, which is $$a(\phi)(\hat{F}\hat{V}\hat{F})^{-1}$$. This matrix can be calculated using the vcov() function. It’s calculated by QR decomposition, so the result is slightly different than those from SAS.

library(aod)
wald.test(vcov(fit), b = coef(fit), L = L)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.0, df = 2, P(> X2) = 0.049

In SAS, the analysis of maximum likelihood estimates is done by Wald test and a Chi-square distribution with degree of freedom 1. This is the same as the a Z-test with a standard normal distribution. The Z statistic is the square root of the Wald Chi-square statistic and the p-values are the same.

### Concordance

The concordant and discordant percent all requires the fitted value to be calculated. Note that when we are modeling the probability for develop because it is the second level in the variable market.

i <- volatile$market == 'develop' j <- volatile$market == 'emerge'
pairs <- sum(i) * sum(j)

# expand pairs into grid
grid <- with(volatile,
cbind(expand.grid(yi = market[i], yj = market[j]),
expand.grid(pi = fitted(fit)[i], pj = fitted(fit)[j])))

# Percent Concordant
sum(grid$pi > grid$pj) / pairs
## [1] 0.9910714
# Percent Discordant
sum(grid$pi < grid$pj) / pairs
## [1] 0.008928571
# Somers' D
nc <- sum(grid$pi > grid$pj)
(nc - (pairs - nc)) / pairs
## [1] 0.9821429
# Tau-a
2 * (nc - (pairs - nc)) / (nobs(fit) * (nobs(fit) - 1))
## [1] 0.5057471

summary(fit)$coefficients ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -11.56273217 40.9868601 -0.2821083 0.7778605 ## volatile 0.04500712 0.9463449 0.0475589 0.9620678 ## credit 0.17384805 0.2632646 0.6603547 0.5090262 The estimates of odds ratio is obtained by applying the exponential function to the parameter estimates and confidence intervals. # profile likelihood confidence interval confint(fit) ## Waiting for profiling to be done... ## 2.5 % 97.5 % ## (Intercept) -57.2306037 NA ## volatile NA 0.8767002 ## credit -0.8975829 0.5591083 exp(cbind(OR = coef(fit)[-1], confint(fit)[-1, ])) # odds ratio ## Waiting for profiling to be done... ## OR 2.5 % 97.5 % ## volatile 1.046035 NA 2.402957 ## credit 1.189875 0.4075536 1.749112 # Wald confidence intervals confint.default(fit) ## 2.5 % 97.5 % ## (Intercept) -91.8955018 68.7700374 ## volatile -1.8097948 1.8998090 ## credit -0.3421412 0.6898373 exp(coef(fit)[-1]) # odds ratio ## volatile credit ## 1.046035 1.189875 exp(confint.default(fit)[-1, ]) ## 2.5 % 97.5 % ## volatile 0.1636877 6.684618 ## credit 0.7102479 1.993391 ### Classification table SAS output have a nice classification table that shows the classification results under different cutoffs. Thanks to this answer, R can achieve a similar table as well, though there is a little bit difference due to precision (the maximum absolute difference between the two predicted probabilities are 1.890711e-08). getMisclass <- function(cutoff, p, event) { pred <- factor(1*(p > cutoff), labels = levels(event)) t <- table(pred, event) cat("cutoff ", cutoff, ":\n") print(t) cat("correct :", round(sum(t[c(1,4)])/sum(t), 3),"\n") cat("Specificity:", round(t[1]/sum(t[,1]), 3),"\n") cat("Sensitivity:", round(t[4]/sum(t[,2]), 3),"\n\n") invisible(t) } cutoffs <- seq(0.1, .9, by=.1) getMisclass(.5, p = fitted(fit), event = volatile$market)
## cutoff  0.5 :
##          event
## pred      emerge develop
##   emerge      15       1
##   develop      1      13
## correct    : 0.933
## Specificity: 0.938
## Sensitivity: 0.929

You can apply this function to get a more thorough output like SAS.

sapply(cutoffs, getMisclass, p=fitted(fit), event=volatile\$market)