(Updated 2019/03) A Mistake Caused by Ifelse and Factors

Updated 2019/03: Just as I expected at the end of this post two years ago, a new package vctrs has come out to handle type stability. You can watch Hadley’s talk at the RStudio conference 2019, in which he talked about base ifelse being too free and dplyr::if_else being too strict.


I made a mistake in one of the data processing script I wrote in R a while ago and figured I should share the story with everybody.

The mistake was caught recently when I had to re-do a similar request again. I’ve already discarded the script, so I started from scratch, and found that part of the numbers didn’t match the old ones in one of the tables I sent. Without the old scripts, it’s quite hard to figure out what happened, and it turned out it’s just one letter that caused the problem.

Here’s a simplified example. There are two variables in this survey data set, and they work together as a time measurement. One is called something like ABC_DEFG123HIJ_UN that stands for unit of time, and the other is called ABC_DEFG123HIJ_NN which stands for the actual number. Here’s a snippet of how the data looks like. In reality I can’t view or print out the data because it’s too big.

##    CASEID ABC_DEFG123HIJ_UN ABC_DEFG123HIJ_NN
## 1      31     (2) 2 = Hours                 2
## 2      21   (1) 1 = Minutes                 5
## 3      70     (2) 2 = Hours                 1
## 4      22     (2) 2 = Hours                NA
## 5      47   (1) 1 = Minutes                30
## 6      61     (2) 2 = Hours                 2
## 7      25   (1) 1 = Minutes                 5
## 8      43     (2) 2 = Hours                 3
## 9      54   (1) 1 = Minutes                NA
## 10     10     (2) 2 = Hours                 1

To convert everything into time and calculate average (and also base size), this is all I needed to do.

library(dplyr)
sample_data %>% 
  mutate(time = ifelse(ABC_DEFG123HIJ_UN == "(1) 1 = Minutes", 
                       ABC_DEFG123HIJ_NN, ABC_DEFG123HIJ_NN * 60)) %>% 
  summarise(avg_time = mean(time, na.rm = TRUE), 
            n = sum(!is.na(time)))
## # A tibble: 1 x 2
##   avg_time     n
##      <dbl> <int>
## 1     72.5     8

However, this is not what I’ve reported! I had an average of \(60.4\) and \(9\) non-missing observations. You probably have noticed that \(9\) non-missing value is an error already, but without actually seeing the data I wouldn’t know. The average of \(72.5\) is close but since n is also different, I started to suspect that I might have used a wrong variable. Since I’m sure I’m using the same data, logically it can only be from a different variable that has \(9\) non-missing values, right?

After going through all 2000 variables, I can’t find any other variable that has exactly 9 non-missing values and produces an average of \(60.4\).

After struggling for 2 days, I found that the following script gave me the exact same result as before. Note that I misspelled _NN as _UN, probably because of auto-completion.

sample_data %>% 
  mutate(time = ifelse(ABC_DEFG123HIJ_UN == "(1) 1 = Minutes", 
                       ABC_DEFG123HIJ_UN,  # A typo here
                       ABC_DEFG123HIJ_NN * 60)) %>% 
    summarise(avg_time = mean(time, na.rm = TRUE), 
              n = sum(!is.na(time)))
## # A tibble: 1 x 2
##   avg_time     n
##      <dbl> <int>
## 1     60.4     9

A closer look at what happened. I accidentally put a factor and a numeric into one ifelse call, it converts them all to integer silently, and returned the integer levels of the factor. Not only did it change the values silently, it coincidentally converted the missing value too.

Consider a simpler example.

set.seed(123)
factor(LETTERS[1:5])
## [1] A B C D E
## Levels: A B C D E
# returns numeric vector instead of letters and numbers as character
ifelse(rnorm(5) > 0, factor(LETTERS[1:5]), 6:10)  
## [1] 6 7 3 4 5

Note that the 3, 4, 5 was from the factor of letters. You usually see it like characters when it was printed, but it’s actually integers with labels. For more details of factors, see this paragraph from Advanced R.

Comparing to this one, where one factor and one character are in an ifelse call, the factor will be converted to character, but not to their labels, but to the integer values.

ifelse(rnorm(5) > 0, factor(LETTERS[1:5]), LETTERS[6:10])
## [1] "1" "2" "H" "I" "J"

In my data, the missing values associated with level "(2) 2 = Hours" will not be affected, because it’s in the FALSE statement of the ifelse. However, the missing values in “(1) 1 = Minutes”, because of the misspelling, were converted to integer \(1\), along with other values, in the TRUE statement. Look at all the time values that comes from a "(1) 1 = Minutes".

sample_data %>% 
  mutate(time = ifelse(ABC_DEFG123HIJ_UN == "(1) 1 = Minutes", 
                       ABC_DEFG123HIJ_UN,  # A typo here
                       ABC_DEFG123HIJ_NN * 60))
## # A tibble: 10 x 4
##    CASEID ABC_DEFG123HIJ_UN ABC_DEFG123HIJ_NN  time
##     <int> <fct>                         <dbl> <dbl>
##  1     31 (2) 2 = Hours                     2   120
##  2     21 (1) 1 = Minutes                   5     1
##  3     70 (2) 2 = Hours                     1    60
##  4     22 (2) 2 = Hours                    NA    NA
##  5     47 (1) 1 = Minutes                  30     1
##  6     61 (2) 2 = Hours                     2   120
##  7     25 (1) 1 = Minutes                   5     1
##  8     43 (2) 2 = Hours                     3   180
##  9     54 (1) 1 = Minutes                  NA     1
## 10     10 (2) 2 = Hours                     1    60

The typo created a valid new column from part-factor and part-integer, without any warning or error message, and perfectly disguised with some missing values carried over from “Hours”, and that caused me to look for variables that has 9 non-missing values and overlooked the real problem.

This is really annoying – even a function as simple as ifelse is not that straightforward and safe to use. The mistake would be more obvious if I wasn’t using two variable names that close, but the unexpected behavior dealing with two variables from different classes still shouldn’t be ignored.

Then Hadley came to rescue. Use dplyr::if_else instead of ifelse to improve error and missing value handling.

# Give warnings when converting integer to factors
if_else(rnorm(5) > 0, factor(LETTERS[1:5]), 6:10)  
## Warning in `[<-.factor`(`*tmp*`, i, value = 10L): invalid factor level, NA
## generated
## [1] A    B    C    D    <NA>
## Levels: A B C D E
# Don't convert factors
if_else(rnorm(5) > 0, factor(LETTERS[1:5]), LETTERS[6:10])
## Error: `false` has type 'character' not 'integer'
# Handling missing in condition
if_else(c(TRUE, FALSE, NA), 1, 0, missing = -1)
## [1]  1  0 -1

Therefore I recommend using dplyr::if_else whenever possible and use caution when dealing with factors. There might be some new package coming out soon to handle factors appropriately.