1 Prérequis supposés

1.1 Prerequis

  • Savoir lire l’anglais
  • Bases de R
    • Créer des objets
    • Opérations de base
  • Objets de base en R
    • Vecteurs
    • Listes
    • Data.frame
  • Fonctions de bases
  • Import de données

1.2 Références utiles

Les deux premiers sont disponible en ligne

2 Tidyverse

2.1 What is tidyverse?

  • Set of packages for tidy and unified data processing, visualization and modelling.

  • Consists in a set of method having unified structure.

  • Involves slight modifications in classical R grammar.

  • This grammar becomes dominant, and highly documented.

2.2 Loading tidyverse

library(tidyverse)
  • Load a lot of packages
  • Some of them are often updated! Watch for updates!

3 The tibble format

3.1 tibble, a nicer data.frame

  • The tibble is a “modern” version of the data.frame
iris_tibble <- as_tibble(iris) # Natural conversion
  • The printing is naturally cropped, giving types, number of rows, cols
iris_tibble
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

3.2 Creating a tibble

The function tibble creates naturally a tibble.

my_tibble <- tibble(Nom = c("Alice", "Bob", "Claire"),
                    Age = c(10, 25, 30))

It can be neatly printed with knitr::kable, like a data.frame

my_tibble
# A tibble: 3 x 2
  Nom      Age
  <chr>  <dbl>
1 Alice     10
2 Bob       25
3 Claire    30

3.3 Extracting in tibble

Exactly as in data.frame

my_tibble[1] # Returns a tibble
my_tibble[, "Nom"]# Returns a tibble. Equivalent to my_tibble["Nom"]
# For a data.frame, this last line would return a vector, not a data.frame

To extract columns, the pull function is introduced (also works with data.frame).

dplyr::pull(my_tibble, Nom)# The pull function extracts a column to a vector
[1] "Alice"  "Bob"    "Claire"
my_tibble$Nom # But the old fashion way still works
[1] "Alice"  "Bob"    "Claire"

3.4 Row names in tibble

For programming reasons, tibble can’t have row names

head(swiss[, 1:3], n = 2)
           Fertility Agriculture Examination
Courtelary      80.2        17.0          15
Delemont        83.1        45.1           6
print(as_tibble(swiss[, 1:3]), n = 2)
# A tibble: 47 x 3
  Fertility Agriculture Examination
      <dbl>       <dbl>       <int>
1      80.2        17            15
2      83.1        45.1           6
# … with 45 more rows
print(as_tibble(swiss[, 1:3], rownames = "Province"), n = 2) # Creating a new column
# A tibble: 47 x 4
  Province   Fertility Agriculture Examination
  <chr>          <dbl>       <dbl>       <int>
1 Courtelary      80.2        17            15
2 Delemont        83.1        45.1           6
# … with 45 more rows

4 %>% For sequential data processing

4.1 diamonds data set

First, we load a toy data set

data(diamonds) # Load the diamond data set
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

4.2 Example of sequential processing

  • Imagine we want to compute the mean (omitting NAs) price of diamonds having a Good cut and a color of type D.

There is the one-line way, rather cumbersome

mean(diamonds[diamonds$cut == "Good" & diamonds$color == "D", ]$price, na.rm = T)
[1] 3405.382

There is the way in multiple lines, creating intermediary objects

condition <- diamonds$cut == "Good" & diamonds$color == "D"
sub_prices <- diamonds[condition, ]$price # could have use pull
mean(sub_prices, na.rm = T)

4.3 Tidier sequential processing using %>%

The “pipe” instruction %>% allows to write this sequential instruction in an easy to read way, without creating intermediary objects.

  • x %>% f() is equivalent to f(x)

  • x %>% f(y) is equivalent to f(x, y)

  • When you read code, %>% is pronounced “then”

# Same example as before
diamonds %>% # We take the data, then
  dplyr::filter(cut == "Good", color == "D") %>% # Subsetting (using dplyr)
  dplyr::pull(price) %>% # extract the price
  mean(na.rm = T) # compute the mean, omitting the NAs
[1] 3405.382
  • The result of the previous treatment is set as the first default argument of the next function
  • No redundancy of the diamonds using $
  • The keyboard shortcut in Rstudio for %>% is Ctrl + Maj + M
  • Note that the indenting naturally allows commenting

4.4 More about the %>%

If you need to specify specifically which argument you want your input to go to , you can use the dot “.”

# Adjusting a linear regression on diamonds price w.r.t. carat
diamonds %>%
  lm(data = ., formula = price ~ carat) # The "." refers to entering argument

Some advices (from the bible of tidyverse)

  • Use %>% to emphasize a sequence of actions, rather than the object that the actions are being performed on.

  • Avoid using the pipe when:

    • You need to manipulate more than one object at a time. Reserve pipes for a sequence of steps applied to one primary object.
    • There are meaningful intermediate objects that could be given informative names.

5 Manipulating data

5.1 The dplyr package

library(dplyr)

dplyr is a package (part of tidyverse) which allows you to solve the vast majority of your data-manipulation challenge:

  • create variables

  • pick variables

  • reorder observations

  • pick observations

  • create summaries

Functions in this package are verbs and have consistent structures.

5.2 Extracting lines on number: slice()

  • You can extract lines (or slicing the data) using slice()
  • The results remains a tibble (or a data.frame)
diamonds %>%
  slice(c(3, 8)) # Select lines 3 and 8
# A tibble: 2 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
2  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53

5.3 Conditional extraction: filter()

  • dplyr::filter allows to return rows with matching conditions
data(diamonds, package = "ggplot2") # data set use
diamonds %>%
  dplyr::filter(color == "D", clarity == "SI1")# the , is equivalent to an "&"
# A tibble: 2,083 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
 2  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
 3  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
 4  0.75 Very Good D     SI1      63.2    56  2760  5.8   5.75  3.65
 5  0.71 Very Good D     SI1      63.6    58  2764  5.64  5.68  3.6 
 6  0.71 Ideal     D     SI1      61.9    59  2764  5.69  5.72  3.53
 7  0.73 Very Good D     SI1      60.2    56  2768  5.83  5.87  3.52
 8  0.7  Very Good D     SI1      61.1    58  2768  5.66  5.73  3.48
 9  0.72 Ideal     D     SI1      60.8    57  2782  5.76  5.75  3.5 
10  0.72 Premium   D     SI1      62.7    59  2782  5.73  5.69  3.58
# … with 2,073 more rows
# Equivalent to the one before
diamonds %>%
  dplyr::filter(color == "D" & clarity == "SI1")
# A tibble: 2,083 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
 2  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
 3  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
 4  0.75 Very Good D     SI1      63.2    56  2760  5.8   5.75  3.65
 5  0.71 Very Good D     SI1      63.6    58  2764  5.64  5.68  3.6 
 6  0.71 Ideal     D     SI1      61.9    59  2764  5.69  5.72  3.53
 7  0.73 Very Good D     SI1      60.2    56  2768  5.83  5.87  3.52
 8  0.7  Very Good D     SI1      61.1    58  2768  5.66  5.73  3.48
 9  0.72 Ideal     D     SI1      60.8    57  2782  5.76  5.75  3.5 
10  0.72 Premium   D     SI1      62.7    59  2782  5.73  5.69  3.58
# … with 2,073 more rows
# One with a "or" (using the in)
diamonds %>%
  dplyr::filter(color %in% c("D", "E"), clarity == "SI1")
# A tibble: 4,509 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
 4  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
 5  0.7  Ideal     E     SI1      62.5    57  2757  5.7   5.72  3.57
 6  0.73 Very Good E     SI1      61.6    59  2760  5.77  5.78  3.56
 7  0.75 Very Good D     SI1      63.2    56  2760  5.8   5.75  3.65
 8  0.75 Premium   E     SI1      59.9    54  2760  6     5.96  3.58
 9  0.74 Ideal     E     SI1      62.3    54  2762  5.8   5.83  3.62
10  0.71 Very Good D     SI1      63.6    58  2764  5.64  5.68  3.6 
# … with 4,499 more rows

5.4 Choosing among columns select()

5.4.1 Selecting on column number

  • The basic select works on column number.
diamonds %>%
  select(1, 4) # Selections des colonnes 1 et 4
# A tibble: 53,940 x 2
   carat clarity
   <dbl> <ord>  
 1  0.23 SI2    
 2  0.21 SI1    
 3  0.23 VS1    
 4  0.29 VS2    
 5  0.31 SI2    
 6  0.24 VVS2   
 7  0.24 VVS1   
 8  0.26 SI1    
 9  0.22 VS2    
10  0.23 VS1    
# … with 53,930 more rows

5.4.2 Selecting on names

  • The basic select works on column names.
diamonds %>%
  select(carat, cut, depth)
# A tibble: 53,940 x 3
   carat cut       depth
   <dbl> <ord>     <dbl>
 1  0.23 Ideal      61.5
 2  0.21 Premium    59.8
 3  0.23 Good       56.9
 4  0.29 Premium    62.4
 5  0.31 Good       63.3
 6  0.24 Very Good  62.8
 7  0.24 Very Good  62.3
 8  0.26 Very Good  61.9
 9  0.22 Fair       65.1
10  0.23 Very Good  59.4
# … with 53,930 more rows

5.4.3 Removing column

One can remove column using select, by adding - in front of the column name:

diamonds %>%
  select(-carat, -cut) # Select all columns but carat and cut
# A tibble: 53,940 x 8
   color clarity depth table price     x     y     z
   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 E     SI2      61.5    55   326  3.95  3.98  2.43
 2 E     SI1      59.8    61   326  3.89  3.84  2.31
 3 E     VS1      56.9    65   327  4.05  4.07  2.31
 4 I     VS2      62.4    58   334  4.2   4.23  2.63
 5 J     SI2      63.3    58   335  4.34  4.35  2.75
 6 J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 H     SI1      61.9    55   337  4.07  4.11  2.53
 9 E     VS2      65.1    61   337  3.87  3.78  2.49
10 H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

5.4.4 Selecting on a condition select_if()

  • One can select on column condition with select_if
  • Put in argument a function on column that returns a TRUE/FALSE
diamonds %>%
  select_if(.predicate = is.factor) # Select only factor columns, the .predicate is optional
# A tibble: 53,940 x 3
   cut       color clarity
   <ord>     <ord> <ord>  
 1 Ideal     E     SI2    
 2 Premium   E     SI1    
 3 Good      E     VS1    
 4 Premium   I     VS2    
 5 Good      J     SI2    
 6 Very Good J     VVS2   
 7 Very Good I     VVS1   
 8 Very Good H     SI1    
 9 Fair      E     VS2    
10 Very Good H     VS1    
# … with 53,930 more rows
diamonds %>%
  select_if(is.numeric) # Select only numeric columns
# A tibble: 53,940 x 7
   carat depth table price     x     y     z
   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23  61.5    55   326  3.95  3.98  2.43
 2  0.21  59.8    61   326  3.89  3.84  2.31
 3  0.23  56.9    65   327  4.05  4.07  2.31
 4  0.29  62.4    58   334  4.2   4.23  2.63
 5  0.31  63.3    58   335  4.34  4.35  2.75
 6  0.24  62.8    57   336  3.94  3.96  2.48
 7  0.24  62.3    57   336  3.95  3.98  2.47
 8  0.26  61.9    55   337  4.07  4.11  2.53
 9  0.22  65.1    61   337  3.87  3.78  2.49
10  0.23  59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

5.4.5 Selecting on a characteristic select_at()

  • One can select on position using select_at. It uses the dplyr function vars that allows to enclose text expressions.
diamonds %>%
  select_at(.vars = vars(starts_with("c"))) # Select columns starting with "c"
# A tibble: 53,940 x 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# … with 53,930 more rows

5.5 Extracting columns with pull()

  • Function to extract a column (by its name or number) as a vector
diamonds %>%
  pull(cut) %>%  # extract the cut variable by name as a vector
  head(n = 5) # only print first 5 values
[1] Ideal   Premium Good    Premium Good   
Levels: Fair < Good < Very Good < Premium < Ideal
diamonds %>%
  pull(2) %>% # extract the cut by number variable as a vector
  head(n = 5)
[1] Ideal   Premium Good    Premium Good   
Levels: Fair < Good < Very Good < Premium < Ideal

5.6 Renaming columns rename()

5.6.1 Basic renaming

  • You might want to rename column(s)
diamonds %>%
  rename(length = x, # Always New = old
         width = y) # You can rename multiple columns at once
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price length width     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>  <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326   3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326   3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327   4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334   4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335   4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336   3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336   3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337   4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337   3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338   4     4.05  2.39
# … with 53,930 more rows

5.6.2 Renaming on condition rename_if()

  • In the same way, a rename_if base on a condition
  • Require as argument a function to apply to each selected column.
# Add the "_fact" suffix to every factor column name
diamonds %>%
  rename_if(.predicate = is.factor, .funs = function(x) paste0(x, "_fact"))
# A tibble: 53,940 x 10
   carat cut_fact  color_fact clarity_fact depth table price     x     y     z
   <dbl> <ord>     <ord>      <ord>        <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E          SI2           61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E          SI1           59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E          VS1           56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I          VS2           62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J          SI2           63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J          VVS2          62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I          VVS1          62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H          SI1           61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E          VS2           65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H          VS1           59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

5.6.3 Renaming on a characteristic rename_at()

  • As well as a rename_at()
  • Require as argument a function to apply to each selected column.
# Putting in upper case columns starting with "c"
diamonds %>%
  rename_at(.vars = vars(starts_with("c")), .funs = toupper)
# A tibble: 53,940 x 10
   CARAT CUT       COLOR CLARITY depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

5.7 Create or modify a column: mutate()

5.7.1 Basic modication

  • You can modify a column
diamonds %>%
  mutate(cut = factor(cut, labels = LETTERS[5:1])) # modifying labels of cut
# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 A     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 B     E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 D     E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 B     I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 D     J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 C     J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 C     I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 C     H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 E     E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 C     H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows
  • The same function also creates columns
diamonds %>%
  mutate(z_square = z^2) # creates a new column as function of an existing one
# A tibble: 53,940 x 11
   carat cut       color clarity depth table price     x     y     z z_square
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43     5.90
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31     5.34
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31     5.34
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63     6.92
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75     7.56
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48     6.15
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47     6.10
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53     6.40
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49     6.20
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39     5.71
# … with 53,930 more rows
  • You can create/modify multiple columns at once
  • This will be done sequentially!
diamonds %>%
  mutate(cut = factor(cut, labels = LETTERS[5:1]), # modifying labels of cut
         y = 2 * y, # doubling the width
         z_square = z^2, # creating a column z_square
         cut = z_square) # This modification overrides the first one
# A tibble: 53,940 x 11
   carat   cut color clarity depth table price     x     y     z z_square
   <dbl> <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23  5.90 E     SI2      61.5    55   326  3.95  7.96  2.43     5.90
 2  0.21  5.34 E     SI1      59.8    61   326  3.89  7.68  2.31     5.34
 3  0.23  5.34 E     VS1      56.9    65   327  4.05  8.14  2.31     5.34
 4  0.29  6.92 I     VS2      62.4    58   334  4.2   8.46  2.63     6.92
 5  0.31  7.56 J     SI2      63.3    58   335  4.34  8.7   2.75     7.56
 6  0.24  6.15 J     VVS2     62.8    57   336  3.94  7.92  2.48     6.15
 7  0.24  6.10 I     VVS1     62.3    57   336  3.95  7.96  2.47     6.10
 8  0.26  6.40 H     SI1      61.9    55   337  4.07  8.22  2.53     6.40
 9  0.22  6.20 E     VS2      65.1    61   337  3.87  7.56  2.49     6.20
10  0.23  5.71 H     VS1      59.4    61   338  4     8.1   2.39     5.71
# … with 53,930 more rows

5.7.2 Modifying columns on a condition mutate_if()

  • Again, you can change on a condition mutate_if
diamonds %>%
  mutate_if(is.factor, as.character) # Transform factor columns to character
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows
  • Check the type of cut color and clarity!

5.7.3 Modifying columns on a characteristic mutate_at()

  • And on position, mutate_at
diamonds %>%
  mutate_at(vars(starts_with("c")),
            .funs = function(x) 0) # Put columns to 0
# A tibble: 53,940 x 10
   carat   cut color clarity depth table price     x     y     z
   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1     0     0     0       0  61.5    55   326  3.95  3.98  2.43
 2     0     0     0       0  59.8    61   326  3.89  3.84  2.31
 3     0     0     0       0  56.9    65   327  4.05  4.07  2.31
 4     0     0     0       0  62.4    58   334  4.2   4.23  2.63
 5     0     0     0       0  63.3    58   335  4.34  4.35  2.75
 6     0     0     0       0  62.8    57   336  3.94  3.96  2.48
 7     0     0     0       0  62.3    57   336  3.95  3.98  2.47
 8     0     0     0       0  61.9    55   337  4.07  4.11  2.53
 9     0     0     0       0  65.1    61   337  3.87  3.78  2.49
10     0     0     0       0  59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

5.8 Sorting data: arrange()

diamonds %>%
  arrange(carat, depth) # Sorting on carat, then on depth
# A tibble: 53,940 x 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1   0.2 Premium E     VS2      59      60   367  3.81  3.78  2.24
 2   0.2 Premium E     VS2      59.7    62   367  3.84  3.8   2.28
 3   0.2 Ideal   E     VS2      59.7    55   367  3.86  3.84  2.3 
 4   0.2 Premium E     VS2      59.8    62   367  3.79  3.77  2.26
 5   0.2 Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 6   0.2 Premium E     VS2      61.1    59   367  3.81  3.78  2.32
 7   0.2 Ideal   D     VS2      61.5    57   367  3.81  3.77  2.33
 8   0.2 Premium D     VS2      61.7    60   367  3.77  3.72  2.31
 9   0.2 Ideal   E     VS2      62.2    57   367  3.76  3.73  2.33
10   0.2 Premium D     VS2      62.3    60   367  3.73  3.68  2.31
# … with 53,930 more rows
# Decreasing sort uses the function desc()
diamonds %>%
  arrange(desc(carat, depth)) # Decreasing on carat, increasing on depth
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72
 3  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43
 4  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
 5  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
 6  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
 7  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13
 8  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38
 9  3.51 Premium   J     VS2      62.5    59 18701  9.66  9.63  6.03
10  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
# … with 53,930 more rows

5.9 Summarising data summarise()

  • Useful to summarize a column to a single number.
  • Returns a tibble.
diamonds %>%
  summarise(mean_carat = mean(carat),
            var_carat = var(carat),
            number_diamonds = n(), # dplyr::n() equivalent to nrow(.)
            number_distinct_carats = n_distinct(carat))# ddplyr::n_distinct()
# A tibble: 1 x 4
  mean_carat var_carat number_diamonds number_distinct_carats
       <dbl>     <dbl>           <int>                  <int>
1      0.798     0.225           53940                    273

5.9.1 Summarising columns on condition summarise_if()

  • Of course, summarise_if() and summarise_at() exists!
# Compute the mean of all numeric variables
diamonds %>%
  summarise_if(is.numeric, mean)
# A tibble: 1 x 7
  carat depth table price     x     y     z
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798  61.7  57.5 3933.  5.73  5.73  3.54
diamonds %>%
  summarise_at(.vars = vars(starts_with("c")), # For columns starting by c 
               # count number of unique elements in the column
               .funs = list(n_unique = function(x) length(unique(x))))
# A tibble: 1 x 4
  carat_n_unique cut_n_unique color_n_unique clarity_n_unique
           <int>        <int>          <int>            <int>
1            273            5              7                8

5.10 Grouping data group_by()

  • Often, you might compute statistics depending on groups.
  • group_by wiill allow to perform the summarise per factor of a group
diamonds %>%
  group_by(cut) %>% # depending on the cut
  summarise(Count = n(),# Number of diamonds per group
            Mean_price = mean(price),# Mean price per group
            Mean_carat = var(carat))
# A tibble: 5 x 4
  cut       Count Mean_price Mean_carat
  <ord>     <int>      <dbl>      <dbl>
1 Fair       1610      4359.      0.267
2 Good       4906      3929.      0.206
3 Very Good 12082      3982.      0.211
4 Premium   13791      4584.      0.265
5 Ideal     21551      3458.      0.187
  • In practice group_by changes the way dplyr functions operate on the data. They will operate groupwise.

5.11 Exercise

In the diamonds data set

  1. Rename the x, y, z variables as length, width, height.
  2. Put their values in cm instead of mm.
  3. Create an object diamonds_modif containing these three columns and the clarity, cut, carat and price, when for the cut is Premium or Ideal.
  4. For each combination cut/color, compute the mean of the three variables length, width, height. Keep this summary in an object dim_summary.

5.12 Exercise (Solution)

  • First we create a diamonds_modif as wanted.
diamonds_modif <- diamonds %>%
  rename(length = x, width = y, height = z) %>%
  mutate_at(.vars = c("length", "width", "height"), # Change this column
            .funs = function(x) x / 10) %>% # Applying them the same function
  filter(cut %in% c("Premium", "Ideal")) %>% # Filtering on the cut
  select(-depth, -color, -table) # Exclude non wanted columns

Then, we use summarise() and a group_by!

dim_summary <- diamonds_modif %>%
  group_by(cut, clarity) %>%
  summarise_at(c("length", "width", "height"), mean)
# A tibble: 16 x 5
# Groups:   cut [2]
   cut     clarity length width height
   <ord>   <ord>    <dbl> <dbl>  <dbl>
 1 Premium I1       0.684 0.679  0.413
 2 Premium SI2      0.657 0.654  0.400
 3 Premium SI1      0.605 0.601  0.369
 4 Premium VS2      0.583 0.581  0.357
 5 Premium VS1      0.574 0.571  0.351
 6 Premium VVS2     0.539 0.537  0.330
 7 Premium VVS1     0.504 0.502  0.308
 8 Premium IF       0.523 0.522  0.319
 9 Ideal   I1       0.675 0.674  0.416
10 Ideal   SI2      0.626 0.627  0.387
11 Ideal   SI1      0.578 0.579  0.357
12 Ideal   VS2      0.543 0.543  0.335
13 Ideal   VS1      0.545 0.547  0.337
14 Ideal   VVS2     0.521 0.523  0.322
15 Ideal   VVS1     0.496 0.498  0.306
16 Ideal   IF       0.483 0.485  0.298

6 Manipulating multiple tables

6.1 Multiple tables

  • Consider two different sources of data
  • Don’t hesitate to run the code to see what they look like!
# If required
# install.packages("nycflights13")
# Data on New York city flights and airports
library(nycflights13) # New York city flights of 2013 data set
Error in library(nycflights13): aucun package nommé 'nycflights13' n'est trouvé
table_1 <- airports %>%
  select(name, lon, lat) %>% # Select those 3 columns
  slice(1:2) # 2 first rows
Error in select(., name, lon, lat): objet 'airports' introuvable
table_2 <- airports %>%
  select(name, lon, lat) %>% # Same columns as table_1, different rows
  slice(6:7)
Error in select(., name, lon, lat): objet 'airports' introuvable
table_3 <- airports %>%
  select(alt, tz) %>%
  slice(1:2) # Same rows as table_1, different columns
Error in select(., alt, tz): objet 'airports' introuvable

6.2 Binding tables bind_rows() and bind_cols()

table_1  %>% # Binding table_2 below table_1
  bind_rows(table_2) # Column names must match!
Error in list2(...): objet 'table_1' introuvable
table_1 %>% # Binding table_3 next to table_1
  bind_cols(table_3) # Number of rows must match!
Error in list2(...): objet 'table_1' introuvable

6.3 Joining tables

6.3.1 Left jointure left_join()

  • Suppose we have the two following tables
drivers <- tibble(name = c("Sue", "Sue", "Marc", "Gunter", "Rayan", "Rayan"),
                    car = c("Clio", "ZX", "AX", "Lada", "Twingo", "Clio"))
vehicles <- tibble(car = c("Twingo", "Ferrari", "Clio", "Lada", "ZX"),
                   speed = c("140", "280", "160", "85", "160"))
drivers
# A tibble: 6 x 2
  name   car   
  <chr>  <chr> 
1 Sue    Clio  
2 Sue    ZX    
3 Marc   AX    
4 Gunter Lada  
5 Rayan  Twingo
6 Rayan  Clio  
vehicles
# A tibble: 5 x 2
  car     speed
  <chr>   <chr>
1 Twingo  140  
2 Ferrari 280  
3 Clio    160  
4 Lada    85   
5 ZX      160  
  • We would like to joint the tables to obtain the speed for each driver
left_join(drivers, vehicles, by = "car")
# A tibble: 6 x 3
  name   car    speed
  <chr>  <chr>  <chr>
1 Sue    Clio   160  
2 Sue    ZX     160  
3 Marc   AX     <NA> 
4 Gunter Lada   85   
5 Rayan  Twingo 140  
6 Rayan  Clio   160  
  • As no record was given for the speed of an AX, an NA is inserted.
  • The Ferrari car is omitted as it was not present in the first table
  • Watch out:left_join() is not symmetric
left_join(vehicles, drivers, by = "car")
# A tibble: 6 x 3
  car     speed name  
  <chr>   <chr> <chr> 
1 Twingo  140   Rayan 
2 Ferrari 280   <NA>  
3 Clio    160   Sue   
4 Clio    160   Rayan 
5 Lada    85    Gunter
6 ZX      160   Sue   
  • left_join(x, y) is equivalent to right_join(y, x)

6.3.2 Inner jointures inner_join()

  • inner_join() only keeps lines where info is present on both tables
inner_jointure <- inner_join(drivers, vehicles, by = "car")
  • Both the Marc and Ferrari lines are missing.

6.3.3 Full jointures full_join()

  • full_join() keeps all lines, putting NA if necessary
full_jointure <- full_join(drivers, vehicles, by = "car")
  • Both the Marcand Ferrari lines are present, with NA.

6.4 Exercise

For the diamonds_modif object, compute, for each combination cut/clarity, the number of observations per group. Include it in the table dim_summary created in the previous exercise (previous section).

6.5 Exercise (Solution)

diamonds_modif %>%
  group_by(cut, clarity) %>%
  summarise(Nb_obs = n()) %>%
  left_join(dim_summary, by = c("cut", "clarity"))
# A tibble: 16 x 6
# Groups:   cut [2]
   cut     clarity Nb_obs length width height
   <ord>   <ord>    <int>  <dbl> <dbl>  <dbl>
 1 Premium I1         205  0.684 0.679  0.413
 2 Premium SI2       2949  0.657 0.654  0.400
 3 Premium SI1       3575  0.605 0.601  0.369
 4 Premium VS2       3357  0.583 0.581  0.357
 5 Premium VS1       1989  0.574 0.571  0.351
 6 Premium VVS2       870  0.539 0.537  0.330
 7 Premium VVS1       616  0.504 0.502  0.308
 8 Premium IF         230  0.523 0.522  0.319
 9 Ideal   I1         146  0.675 0.674  0.416
10 Ideal   SI2       2598  0.626 0.627  0.387
11 Ideal   SI1       4282  0.578 0.579  0.357
12 Ideal   VS2       5071  0.543 0.543  0.335
13 Ideal   VS1       3589  0.545 0.547  0.337
14 Ideal   VVS2      2606  0.521 0.523  0.322
15 Ideal   VVS1      2047  0.496 0.498  0.306
16 Ideal   IF        1212  0.483 0.485  0.298

7 Cleaning and transforming data

When you have a dataset, there are three interrelated rules which make your dataset tidy:

  1. Each variable must have its own column

  2. Each observation must have its own row

  3. Each value must have its own cell

This mean that you must have a clear idea of what is a variable, an observation, or even a value!

7.1 A toy example: grades dataset

grades <- tibble(
  Name = c("Tommy", "Mary", "Gary", "Cathy"),
  Sex_age = c("m_15", "f_15", "m_16", "f_14"),
  Test1 = c(10, 15, 16, 14),
  Test2 = c(11, 13, 10, 12),
  Test3 = c(12, 13, 17, 10)
  )
Name Sex_age Test1 Test2 Test3
Tommy m_15 10 11 12
Mary f_15 15 13 13
Gary m_16 16 10 17
Cathy f_14 14 12 10

One can spot multiple potential problems:

  • The Sex_age column gathers two variables
  • There are three variables tests, which can be what we want, but we might want a variable Grade whose value is a grade (and might depend on which test was performed!)

7.2 Splitting a column into 2: separate()

  • The Sex_age column should be two columns (two variables)
grades_s_a <- # grades with two columns sex age
  grades %>%
  separate(Sex_age, into = c("Sex", "Age"), sep = "_")
print(grades_s_a)
# A tibble: 4 x 6
  Name  Sex   Age   Test1 Test2 Test3
  <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Tommy m     15       10    11    12
2 Mary  f     15       15    13    13
3 Gary  m     16       16    10    17
4 Cathy f     14       14    12    10
  • The sep argument can handle regular expressions.
  • Not trivial but really powerful tool for string manipulation.

7.3 Merging two columns: unite()

  • The inverse of separate is unite()
grades_s_a  %>%
  unite(col = "Sex_age", Sex, Age, sep = "_") %>%
  knitr::kable(format = "html")
Name Sex_age Test1 Test2 Test3
Tommy m_15 10 11 12
Mary f_15 15 13 13
Gary m_16 16 10 17
Cathy f_14 14 12 10

7.4 Gathering columns: pivot_longer()

  • Suppose that the observation of interest is the grade
  • In grade_s_a, the a same variable (the grade) is coded in three columns.
  • We might gather them in one column, and keep test number in one column.
  • The table will be a in long format that might be better suited for our use.
  • We use the function pivot_longer.
grades_long <- grades_s_a %>%
  pivot_longer(starts_with("Test"), # Gathered columns
               names_to = "Test", # Column gathering old column names
               values_to = "Grade") # Column gathering values
grades_long
# A tibble: 12 x 5
   Name  Sex   Age   Test  Grade
   <chr> <chr> <chr> <chr> <dbl>
 1 Tommy m     15    Test1    10
 2 Tommy m     15    Test2    11
 3 Tommy m     15    Test3    12
 4 Mary  f     15    Test1    15
 5 Mary  f     15    Test2    13
 6 Mary  f     15    Test3    13
 7 Gary  m     16    Test1    16
 8 Gary  m     16    Test2    10
 9 Gary  m     16    Test3    17
10 Cathy f     14    Test1    14
11 Cathy f     14    Test2    12
12 Cathy f     14    Test3    10
  • This format is often better for visualisation with ggplot.
  • The dimension of grades_gathered is now \(12\times 5\).
  • Alternative code not enumerating all columns to gather, but excluding non gathered ones
grades_long <- grades_s_a %>%
  pivot_longer(cols = -c("Name", "Age", "Sex"), # non gathered columns
               names_to = "Test", # Column gathering old column names
               values_to = "Grade") # Column gathering values
grades_long
# A tibble: 12 x 5
   Name  Sex   Age   Test  Grade
   <chr> <chr> <chr> <chr> <dbl>
 1 Tommy m     15    Test1    10
 2 Tommy m     15    Test2    11
 3 Tommy m     15    Test3    12
 4 Mary  f     15    Test1    15
 5 Mary  f     15    Test2    13
 6 Mary  f     15    Test3    13
 7 Gary  m     16    Test1    16
 8 Gary  m     16    Test2    10
 9 Gary  m     16    Test3    17
10 Cathy f     14    Test1    14
11 Cathy f     14    Test2    12
12 Cathy f     14    Test3    10

7.4.1 Why longer format?

  • The longer format is (often) more convenient for summarizing.
  • If we want the mean grade per test, for each sex
grades_long %>% # Initial data
  group_by(Sex, Test) %>% # Grouping of interest
  summarise(Mean = mean(Grade)) # Get the summary of interest, per group
# A tibble: 6 x 3
# Groups:   Sex [2]
  Sex   Test   Mean
  <chr> <chr> <dbl>
1 f     Test1  14.5
2 f     Test2  12.5
3 f     Test3  11.5
4 m     Test1  13  
5 m     Test2  10.5
6 m     Test3  14.5

7.5 Get a wider table pivot_wider()

  • The inverse of pivot_longer is pivot_wider()
  • It spreads a column into multiples, according to a key column
# The sep argument is used to give correct names to column
grades_long %>%
  pivot_wider(names_from = "Test", values_from = "Grade")
# A tibble: 4 x 6
  Name  Sex   Age   Test1 Test2 Test3
  <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Tommy m     15       10    11    12
2 Mary  f     15       15    13    13
3 Gary  m     16       16    10    17
4 Cathy f     14       14    12    10

7.6 Input missing data: complete()

  • Sometimes, when a value is missing, it is not recorded
notes <- tibble(Nom = c("Alain", "Alain", "Benoit", "Claire"),
                Discipline = c("Maths", "Francais", "Maths", "Francais"),
                Note = c(16, 9, 17, 11),
                Present = rep("oui", 4))
Nom Discipline Note Present
Alain Maths 16 oui
Alain Francais 9 oui
Benoit Maths 17 oui
Claire Francais 11 oui
  • Benoit has no note in French , Claire has no note in maths.
  • You might want to complete the data such that each Name/Discipline has a note.
notes %>%
  complete(Nom, Discipline)
# A tibble: 6 x 4
  Nom    Discipline  Note Present
  <chr>  <chr>      <dbl> <chr>  
1 Alain  Francais       9 oui    
2 Alain  Maths         16 oui    
3 Benoit Francais      NA <NA>   
4 Benoit Maths         17 oui    
5 Claire Francais      11 oui    
6 Claire Maths         NA <NA>   
  • The default assignment is NA.
  • You can assign any value to missing data
notes_completes <- notes %>%
  complete(Nom, Discipline,
           fill = list(Note = 0, Present = "non")) # Custom the value per column
Nom Discipline Note Present
Alain Francais 9 oui
Alain Maths 16 oui
Benoit Francais 0 non
Benoit Maths 17 oui
Claire Francais 11 oui
Claire Maths 0 non

7.7 Handling missing data: replace_na

Consider the following table which as NA

donnees_na <- tibble(Groupe = rep(c("A", "B"), each = 3),
       Nom = c("Al", "Bob", NA, "Dave", "Elle", "Fanch"),
       Note = c(NA, 8, 7, 4.5, 1, 4))
donnees_na
# A tibble: 6 x 3
  Groupe Nom    Note
  <chr>  <chr> <dbl>
1 A      Al     NA  
2 A      Bob     8  
3 A      <NA>    7  
4 B      Dave    4.5
5 B      Elle    1  
6 B      Fanch   4  
  • Suppose we want to assign values to NA.
  • Use replace_na.
donnees_na %>% 
  replace_na(replace = list(Nom ="Unknown",
                            Note = 0)) # Assign same value for all columns
# A tibble: 6 x 3
  Groupe Nom      Note
  <chr>  <chr>   <dbl>
1 A      Al        0  
2 A      Bob       8  
3 A      Unknown   7  
4 B      Dave      4.5
5 B      Elle      1  
6 B      Fanch     4  
  • This syntax might not be well suited if many columns are present
  • By combining with mutate_if, one can do several things

7.7.1 Replace NA based on a condition

# Replace NA only in numeric column
donnees_na %>% 
  mutate_if(is.numeric, # We change only if it's numeric
            # The treatment is made column wise
            function(colonne) replace_na(colonne, replace = 0))
# A tibble: 6 x 3
  Groupe Nom    Note
  <chr>  <chr> <dbl>
1 A      Al      0  
2 A      Bob     8  
3 A      <NA>    7  
4 B      Dave    4.5
5 B      Elle    1  
6 B      Fanch   4  

7.7.2 Replace NA by mean of the column

  • It is standard to replace by the mean of the column
# Replace NA by the mean of the numeric column
donnees_na %>% 
  mutate_if(is.numeric, # We change only if it's numeric
            # The treatment is made column wise
            function(colonne) replace_na(colonne, 
                                         replace = mean(colonne, na.rm = TRUE)))
# A tibble: 6 x 3
  Groupe Nom    Note
  <chr>  <chr> <dbl>
1 A      Al      4.9
2 A      Bob     8  
3 A      <NA>    7  
4 B      Dave    4.5
5 B      Elle    1  
6 B      Fanch   4  
  • Of course, by combining with group_by, you can replace the NA by the mean of the group (here, the group A)
# Replace NA by the group mean
donnees_na %>% 
  group_by(Groupe) %>% # We first group by the required group
  # And then apply the same treatment
  mutate_if(is.numeric, # We change only if it's numeric
            # The treatment is made column wise
            function(colonne) replace_na(colonne, 
                                         replace = mean(colonne, na.rm = TRUE)))
# A tibble: 6 x 3
# Groups:   Groupe [2]
  Groupe Nom    Note
  <chr>  <chr> <dbl>
1 A      Al      7.5
2 A      Bob     8  
3 A      <NA>    7  
4 B      Dave    4.5
5 B      Elle    1  
6 B      Fanch   4  

7.8 Exercise

From the diamonds_modif table of the previous exercise, create the diamonds_gathered table having the following the following column structure:

# A tibble: 5 x 5
  price cut     clarity Dimension Value
  <int> <ord>   <ord>   <chr>     <dbl>
1 10096 Premium SI1     width     0.731
2   655 Ideal   VVS1    width     0.439
3   593 Premium SI2     height    0.264
4   765 Ideal   VS1     width     0.443
5  3543 Ideal   VS2     length    0.573

7.9 Exercise (solution)

diamonds_gathered <- diamonds_modif %>%
  select(price, cut, clarity, length, width, height) %>%
  gather(key = "Dimension", value = "Value", length, width, height)

8 Dealing with characters

8.1 stringr package

The stringr provides powerful functions to handle characters.

For instance, let’s consider the following data.

For a full cover of all these possibilities, there are are lot of dedicated pages, for instance, this one.

example <- tibble(Nom = c("Mr_Al_Bob", "Mr_Bob_Col"),
                  Age = c(42, 41))

8.2 Replacing character pattern

  • Say we want to replace Mr by a M.
  • Use the str_replace function
example %>% 
  mutate(Nom = str_replace(Nom, pattern = "Mr", replacement = "M."))
# A tibble: 2 x 2
  Nom          Age
  <chr>      <dbl>
1 M._Al_Bob     42
2 M._Bob_Col    41
  • If we want to replace all the _ by a space
example %>% 
  mutate(Nom = str_replace_all(Nom, pattern = c("_"), replacement = c(" ")))
# A tibble: 2 x 2
  Nom          Age
  <chr>      <dbl>
1 Mr Al Bob     42
2 Mr Bob Col    41
  • We can of course remove the Mr in front, with str_remove
example %>% 
  mutate(Nom = str_replace_all(Nom, pattern = c("_"), replacement = c(" ")),
         Nom = str_remove(Nom, pattern = "Mr "))
# A tibble: 2 x 2
  Nom       Age
  <chr>   <dbl>
1 Al Bob     42
2 Bob Col    41

9 Dealing with factors

9.1 Transforming numeric in factor

Let’s consider the following table

athletes <- tibble(Name = c("Alice", "Bob", "Charles", "Dan", "Elsa", "Fanch"),
                   Performance = c(15, 10, 5, 18, 11, 4),
                   Aptitude = c("Strong", "Intermediate", "Weak", 
                             "Strong", "Intermediate", "Weak"),
                   Qualified = c(1, 1, 0, 1, 1, 0))
athletes
# A tibble: 6 x 4
  Name    Performance Aptitude     Qualified
  <chr>         <dbl> <chr>            <dbl>
1 Alice            15 Strong               1
2 Bob              10 Intermediate         1
3 Charles           5 Weak                 0
4 Dan              18 Strong               1
5 Elsa             11 Intermediate         1
6 Fanch             4 Weak                 0

The column Qualified is here considered a a numeric, whereas it’s a qualitative, 1 meaning “yes”, 0 meaning no.

This can pose problem either for graphical representation or, worse, for model fitting in R.

# We want a boxplot, with one box per level of Qualified
ggplot(athletes) +
  aes(x = Qualified, y = Performance) + 
  geom_boxplot() # Do not work, as Qualified is coded as numeric!
Warning: Continuous x aesthetic -- did you forget aes(group=...)?

To change it, we will change (using mutate) the column to factor.

athletes <- athletes %>% 
  mutate(Qualified = factor(Qualified)) # Changed to qualified
ggplot(athletes) +
  aes(x = Qualified, y = Performance) + 
  geom_boxplot() # Now it works

9.2 Changing factor labelling

The factor Qualified now has two levels, which are 0 and 1. In the graph above, it would be way better if 0/1 was understandable by a user. We can change it using the labels argument to relabel the levels. Watch out, the labels must be specified in the same order than the levels

athletes %>% 
  mutate(Qualified = factor(Qualified, 
                            levels = c(0, 1), # Specify the level order
                            labels = c("No", "Yes"))) %>% # And rename in the same order
  ggplot() +
  aes(x = Qualified, y = Performance) + 
  geom_boxplot() # Now it works

9.3 Change characters in factor

For plots or model, there is no absolute need to transform characters in factor, as R understands that it is qualitative attribute.

# A boxplot of performance per Aptitude
ggplot(athletes) +
  aes(x = Aptitude, y = Performance) + 
  geom_boxplot()

Here, the order of the levels is given by alphabetical order, and is not natural. We then can transform Aptitude to factor to specify the right levels order.

athletes %>% 
  mutate(Aptitude = factor(Aptitude,
                           # Now specify the levels order
                           # Caution, the levels given must appear in the data!
                           # f you want to change labels, use labels = ...
                           levels = c("Weak", "Intermediate", "Strong"))) %>% 
# A boxplot of performance per Aptitude
  ggplot() +
  aes(x = Aptitude, y = Performance) + 
  geom_boxplot() # The levels are now in natural order