Data Frames

R Data Structures

BB

If you are comfortable with lists, then working with data frames will be easy the latter are basically a list with certain additional attributes (e.g., equal length columns). A data frame is also similar to a rectangular matrice that can also hold mixed data types and even other structures (e.g, lists, matrices, other data frames, etc.). One thing that sets R apart as a statistical programming language is that data frames are a built-in structure, unlike other languages (e.g., Python).

Accordingly, data frames behave much in the same way both a list and a matrix do.

iris <- iris

  str(iris)
  
## 'data.frame':	150 obs. of  5 variables:
  ##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  ##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
  ##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
  ##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
  ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  
# Not run for space reasons...
  # attributes(iris)
  

They are also indexed similar to matrices and lists.

attributes(iris)$class <- NULL
  attributes(iris)$class <- "data.frame"

  # Use 'head' or 'tail' to view the top/bottom part of a dataframe...

  head(iris,5)
  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  ## 1          5.1         3.5          1.4         0.2  setosa
  ## 2          4.9         3.0          1.4         0.2  setosa
  ## 3          4.7         3.2          1.3         0.2  setosa
  ## 4          4.6         3.1          1.5         0.2  setosa
  ## 5          5.0         3.6          1.4         0.2  setosa
  

You can also convert rectangular objects to data frames using as.data.frame, but be waryData frames default to factors as opposed to character. If you create a data frame, just the option stringsAsFactors = FALSE to avoid this, or, coerce factor columns to character (see ?as.character). of R's default behavior here.

foo <- list(letters=letters,LETTERS=LETTERS)
  str(foo)
  
## List of 2
  ##  $ letters: chr [1:26] "a" "b" "c" "d" ...
  ##  $ LETTERS: chr [1:26] "A" "B" "C" "D" ...
  
foo <- as.data.frame(foo)
  str(foo)
  
## 'data.frame':	26 obs. of  2 variables:
  ##  $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
  ##  $ LETTERS: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
  

They are also indexed similar to matrices and lists.

# By numeric index...

  iris[1:10,1:2]
  
##    Sepal.Length Sepal.Width
  ## 1           5.1         3.5
  ## 2           4.9         3.0
  ## 3           4.7         3.2
  ## 4           4.6         3.1
  ## 5           5.0         3.6
  ## 6           5.4         3.9
  ## 7           4.6         3.4
  ## 8           5.0         3.4
  ## 9           4.4         2.9
  ## 10          4.9         3.1
  
# By name and row number...

  iris$Species[5]
  
## [1] setosa
  ## Levels: setosa versicolor virginica
  
# By multiple names...

  iris[1:5,c('Sepal.Length','Species')]
  
##   Sepal.Length Species
  ## 1          5.1  setosa
  ## 2          4.9  setosa
  ## 3          4.7  setosa
  ## 4          4.6  setosa
  ## 5          5.0  setosa
  
# Or by negative indexing...

  iris[1:5,-c(2:4)]
  
##   Sepal.Length Species
  ## 1          5.1  setosa
  ## 2          4.9  setosa
  ## 3          4.7  setosa
  ## 4          4.6  setosa
  ## 5          5.0  setosa
  
# By row values...

  head(iris[iris$Species=='setosa',],5)
  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  ## 1          5.1         3.5          1.4         0.2  setosa
  ## 2          4.9         3.0          1.4         0.2  setosa
  ## 3          4.7         3.2          1.3         0.2  setosa
  ## 4          4.6         3.1          1.5         0.2  setosa
  ## 5          5.0         3.6          1.4         0.2  setosa
  

You will frequently come across ?NA values in your data frames as data is often messy. Just know that you need to handle these differently than other values (the same goes for lists).

iris$Petal.Width[1:50*3] <- NA

  length(iris$Petal.Width[iris$Petal.Width==NA]) # This won't work...
  
## [1] 150
  
# But this will...

  iris$Petal.Width[is.na(iris$Petal.Width)]
  
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
  ## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
  ## [47] NA NA NA NA
  

Be careful when checking number of rows based on columns with NA...

NROW(iris[iris$Petal.Width > .2,])
  
## [1] 126
  
NROW(iris[iris$Petal.Width > .2 & !is.na(iris$Petal.Width),])
  
## [1] 76
  

Have you ever wondered if it's possible to create a data frame with unequal length columns? I have, but the answerAs far as I know... is 'no' so please don't try. Or, if you must, save your work first...

You can also shove other crap into a data frame using I to keep it AsIs if you want...

iris$Lists <- I(list(c(1,2,3)))

  head(iris$Lists,5)
  
## [[1]]
  ## [1] 1 2 3
  ##
  ## [[2]]
  ## [1] 1 2 3
  ##
  ## [[3]]
  ## [1] 1 2 3
  ##
  ## [[4]]
  ## [1] 1 2 3
  ##
  ## [[5]]
  ## [1] 1 2 3
  

Columns can be created multiple ways. If you assign a new column a constant, it will be replicated the length of the data frame. Other operations, especially subsetting operations, where assignment length is unequal will fail.

iris$Height <- rep(c(1,2,3))

  # No No

  iris$Height <- rep(c(1,2,3),length.out=100)
  
## Error in `$<-.data.frame`(`*tmp*`, Height, value = c(1, 2, 3, 1, 2, 3, : replacement has 100 rows, data has 150
  

If you are using an IDE like RStudio, be wary of visually identifying row numbers. Row numbers within brackets will always be correct (e.g., df[100:150]), but row numbers as names will not, or at least not be in the correct order, if you have performed subsetting operations.

# Reverting back to the unedited data frame...

  iris <- datasets::iris
  iris <- iris[iris$Species!='veriscolor',]

  iris <- iris[iris$Species!='versicolor',]

  # Note how they the names are no longer consecutive...

  row.names(iris)[46:55]
  
##  [1] "46"  "47"  "48"  "49"  "50"  "101" "102" "103" "104" "105"
  
# They can be reset like this...

  row.names(iris) <- NULL
  

Finally, realize that this is just a very brief overview of base-R's data frame class. Packages like tibble and data.table have their own take on data frames which are most definitely worth exploring.

Comments



Name:


E-mail: