Recoding

Ways to Recode Variables in R

BB

Overview

There are a lot of ways to recode variables in R. In fact, so many that this overview can't possibly cover them all. However, this guide will attempt to cover most of the options available with base-R as well as brief overview of dplyr.

Topics include:

  1. ifelse
  2. match
  3. `[<−` (e.g., named vector look-ups)
  4. gsub
  5. dplyr::case_when
  6. dplyr::recode
  7. Interactive Recoding Function

library(dplyr)
library(nycflights13)

We'll use nycflights13::flightsFrom Hadley Wickham given that it has a good mix of character and numeric variables and while not a small data set, also not so large as make experimenting cumbersome. We'll also use the auxiliary nycflights13::airlines data set as well.

df <- flights
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
airlines <- airlines
glimpse(airlines)
## Observations: 16
## Variables: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", ...
## $ name    <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska...

`ifelse`

What are the options? Most people start off along conditionals route. Let's recode departure delay to a categorical that indicates whether there was a delay or not.

df$dep_delay_cat <- ifelse(df$dep_delay>0,'delay','no delay')

table(df$dep_delay_cat)
##
##    delay no delay
##   128432   200089

Beyond the scope of this page, but see also here...You can nest ifelse for multiple conditions, but gets confusing very quickly...

df$dep_delay_cat <- ifelse(df$dep_delay>0,'late',
                           ifelse(df$dep_delay<0,'early','on-time'))

table(df$dep_delay_cat)
##
##   early    late on-time
##  183575  128432   16514

The code below won't work as its not vectorized...

df$dep_delay_cat2 <- `if`(df$dep_delay>0,'delay','no delay')
## Warning in if (df$dep_delay > 0) "delay" else "no delay": the condition has
## length > 1 and only the first element will be used
head(df$dep_delay_cat2,5)
## [1] "delay" "delay" "delay" "delay" "delay"

If you want to do something along the lines of the aboveMore on `Vectorize`..., use Vectorize, but this is ugly so better to use ifelse...

recode_func <- Vectorize(function(x) {
    if (is.na(x)) {x <- NA} # Need to account for NA!
    else if (x>0) {x <- 'delay'}
    else {x <- 'no delay'}
  })

df$dep_delay_cat2 <- recode_func(df$dep_delay)

head(df$dep_delay_cat2,5)
## [1] "delay"    "delay"    "delay"    "no delay" "no delay"

Subsetting

So far, while helpful, the covered functions are all less than graceful when it comes to recoding a larger number of values. This is where R's built-in vectorized subsetting operations come in extremely handy. There are a number of different ways to do this.

`match`

First, despite the somewhat perplexing documentation (see ?match), match will reliably allow you to recode a single column or vector based on a key-value pair from another data structure.

df$carrier_full <- airlines$name[match(df$carrier,airlines$carrier)]

head(df$carrier_full,5)
## [1] "United Air Lines Inc."  "United Air Lines Inc."
## [3] "American Airlines Inc." "JetBlue Airways"
## [5] "Delta Air Lines Inc."

Named Look-up Table/Subset Method

Using a named vector as a look-up table is equally as powerful (and my preferred method).

airline_lookup <- setNames(airlines$name,airlines$carrier)

df$carrier_full <- airline_lookup[df$carrier]    # See also '?replace'

head(df$carrier_full,5)
## [1] "United Air Lines Inc."  "United Air Lines Inc."
## [3] "American Airlines Inc." "JetBlue Airways"
## [5] "Delta Air Lines Inc."

Note that you will want to have values for each unique value otherwise you will get missing values.

head(airline_lookup[-2][df$carrier],5)
##                      UA                      UA                    <NA>
## "United Air Lines Inc." "United Air Lines Inc."                      NA
##                      B6                      DL
##       "JetBlue Airways"  "Delta Air Lines Inc."

On the other hand, it's perfectly acceptable to have more than values/names than are in whatever object you want to recode. In fact, you could conceivably recode every character vector in a dataframe using this, should you want to. I also regularly use it within an interactive to quickly recode objects with limited number of values. Scroll to the end of the page to see examples of both...

String Substitution (`gsub`)

Although it doesn't save much time, you can use R's native functions for working with strings (or stringr, stringi, etc.) for quick replacements as well...
Note that `gsub` is but one of a number of string methods

head(gsub('LGA','LaGuardia',df$origin))  # According to many people, 'POS' might be a better substitution...
## [1] "EWR"       "LaGuardia" "JFK"       "JFK"       "LaGuardia" "EWR"

`dplyr` Functions

There are a number of useful functions from third-party packages as well. While a number of other recoding tutorials mention CAR::recode from the 'CAR' package, I choose not to as dplyr' has both dplyr::recode and dplyr::case_when which I'll briefly discuss.

`dplyr::case_when`

'case_when' is a good choice when you want to recode based on multiple logical conditions. Although the formula interface is straightforward, the function is not trivial. As the function documentation notes,

'Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general'

The following not only doesn't follow that rule, but also has a few other problems...

# No No

df <- df %>% mutate(angry = case_when(
    dep_delay == 0 ~ 'on-time',
    dep_delay > 60 ~ 'wayyy too ^$%#ing late',
    dep_delay > 20 ~ 'wayyy late',
    dep_delay > 0 ~ 'late',
    dep_delay < 20 ~ 'wayyy early',
    dep_delay < 0 ~ 'early'))

table(df$angry)
##
##                   late                on-time            wayyy early
##                  66799                  16514                 183575
##             wayyy late wayyy too ^$%#ing late
##                  35052                  26581

Instead, do this, as it will capture each unique condition. Note that it may take some trial and error*..*(at least it did for me, but that may because I'm an idiot...)

df<- df %>% mutate(angry = case_when(
    dep_delay == 0 ~ 'on-time',
    dep_delay < -20 ~ 'wayyy early',
    dep_delay < 0  ~ 'early',
    dep_delay > 20 &   dep_delay < 60  ~ 'wayyy late',
    dep_delay >= 60 ~ 'wayyy too ^$%#ing late',
    dep_delay > 0 ~ 'late'))

table(df$angry)
##
##                  early                   late                on-time
##                 183534                  66799                  16514
##            wayyy early             wayyy late wayyy too ^$%#ing late
##                     41                  34574                  27059

`dplyr::recode`

Now let's pretend someone tells you to make your coding scheme more professional and instead of telling them to go $%^# themselves you decide to oblige*...*Because, whatever...

df$dep_delay_cat2 <- recode(df$angry,
    `on-time` = 'Departed On-Schedule',
    `wayyy early` = 'More than 20 Minutes Ahead of Schedule',
    `early` = 'Ahead of Schedule',
    `wayyy too ^$%#ing late` = 'More than 60 Minutes Behind Schedule',
    `wayyy late` = 'Between 20 and 40 Minutes Behind Schedule',
    .default = 'Behind Schedule')

# Note the '.default' option (as well as other options...)

Recoding En Masse

To demonstrate this, let's upset some nerds and recode every character variableScroll to the bottom of the linked page to see a related technique... for the dplyr::starwars data set.

# Full data set also includes double and list columns..

  sw <- dplyr::starwars[sapply(starwars,is.character)]

  # Note, that I purposefully stay with base R here... 

  unique <- unlist(sapply(sw,unique))

  # Multiple NA so we'll just drop them to keep them as is

  unique <- unique[!is.na(unique)]

  wrong_names <- setNames(names(unique),unique)

  wrong_sw <- as.data.frame(lapply(sw, function(x) {

      x <- wrong_names[x]

  }),stringsAsFactors = FALSE)

  # Now...

  knitr::kable(head(wrong_sw))
  
name hair_color skin_color eye_color gender homeworld species
Luke Skywalker name1 hair_color1 skin_color1 skin_color21 gender1 homeworld1 species1
C-3PO name2 NA skin_color2 skin_color23 NA homeworld1 species2
R2-D2 name3 NA skin_color3 skin_color20 NA homeworld2 species2
Darth Vader name4 hair_color3 hair_color9 skin_color23 gender1 homeworld1 species1
Leia Organa name5 hair_color4 skin_color5 hair_color4 gender3 homeworld3 species1
Owen Lars name6 hair_color5 skin_color5 skin_color21 gender1 homeworld1 species1
# And previously..

  knitr::kable(head(sw))
  
name hair_color skin_color eye_color gender homeworld species
Luke Skywalker blond fair blue male Tatooine Human
C-3PO NA gold yellow NA Tatooine Droid
R2-D2 NA white, blue red NA Naboo Droid
Darth Vader none white yellow male Tatooine Human
Leia Organa brown light brown female Alderaan Human
Owen Lars brown, grey light blue male Tatooine Human
# This is a lot more useful if you have actual values in mind that you want to recode to...
  

Interactive Recoding Function

In terms of the function I mentioned above, I use (a variant) of the following to quickly change either a few column values or column names. As the day wears on and I get increasingly tired, elegant subsetting operations become more difficult so its nice to fall back upon something I wrote when I was feeling more fresh. Let's call it i_recode short for interactive recode (VERY ORIGINAL).

i_recode <- function(x) {

    vec <- as.character(x)
    temp <- unique(x)
    old_vector <- select.list(temp, multiple = T)
    new_vector <- vector(length = length(old_vector))
    for (i in seq_along(old_vector)) {
        cat("\nOld Value:", shQuote(old_vector[[i]]), "\n\n")
        new_vector[[i]] <- readline(prompt = "Please Enter New Value: ")
    }

    new_vector <- setNames(new_vector, old_vector)

      vec[which(vec %in% old_vector)] <- new_vector[vec[which(vec %in%
                                                                old_vector)]]
      vec

  }
  

Comments



Name:


E-mail:




Matnile
2019-07-30 10:32:00
Viagra Quebec <a href=http://xzanax.com></a> Combivent Without A Prescription
Matnile
2019-08-02 18:26:00
Farmacie Online Propecia Generic Medicine <a href=http://cialonlineno.com>generic cialis canada</a> Levitra Samples Europe
Matnile
2019-08-05 10:22:00
Pregnancy After Propecia <a href=http://ilfrc.com>venta viagra en madrid</a> Buy Malegra Online
Matnile
2019-08-08 05:55:00
Meridia Weight Loss Online Buying Cheapest Cialis 5 Mg <a href=http://tadalaf.com>buy generic cialis online</a> Cialis Efecto Duracion Youtube Levitra Farmaco Equivalente Priligy