R
R
R for Data Science (G. Grolemund and H. Wickham)
R Markdown: The definitive guide (Y. Xie, J.J. Allaire and G. Grolemund)
ggplot2: Elegant Graphics for Data Analysis (H. Wickham)
Les deux premiers sont disponible en ligne
Set of packages for tidy and unified data processing, visualization and modelling.
Consists in a set of method having unified structure.
Involves slight modifications in classical R
grammar.
This grammar becomes dominant, and highly documented.
library(tidyverse)
tibble
formattibble
, a nicer data.frame
tibble
is a “modern” version of the data.frame
<- as_tibble(iris) # Natural conversion iris_tibble
iris_tibble
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
The function tibble
creates naturally a tibble.
<- tibble(Nom = c("Alice", "Bob", "Claire"),
my_tibble Age = c(10, 25, 30))
It can be neatly printed with knitr::kable
, like a data.frame
my_tibble
# A tibble: 3 x 2
Nom Age
<chr> <dbl>
1 Alice 10
2 Bob 25
3 Claire 30
Exactly as in data.frame
1] # Returns a tibble
my_tibble["Nom"]# Returns a tibble. Equivalent to my_tibble["Nom"]
my_tibble[, # For a data.frame, this last line would return a vector, not a data.frame
To extract columns, the pull
function is introduced (also works with data.frame
).
::pull(my_tibble, Nom)# The pull function extracts a column to a vector dplyr
[1] "Alice" "Bob" "Claire"
$Nom # But the old fashion way still works my_tibble
[1] "Alice" "Bob" "Claire"
tibble
For programming reasons, tibble
can’t have row names
head(swiss[, 1:3], n = 2)
Fertility Agriculture Examination
Courtelary 80.2 17.0 15
Delemont 83.1 45.1 6
print(as_tibble(swiss[, 1:3]), n = 2)
# A tibble: 47 x 3
Fertility Agriculture Examination
<dbl> <dbl> <int>
1 80.2 17 15
2 83.1 45.1 6
# … with 45 more rows
print(as_tibble(swiss[, 1:3], rownames = "Province"), n = 2) # Creating a new column
# A tibble: 47 x 4
Province Fertility Agriculture Examination
<chr> <dbl> <dbl> <int>
1 Courtelary 80.2 17 15
2 Delemont 83.1 45.1 6
# … with 45 more rows
%>%
For sequential data processingFirst, we load a toy data set
data(diamonds) # Load the diamond data set
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
NA
s) price of diamonds having a Good
cut and a color of type D
.There is the one-line way, rather cumbersome
mean(diamonds[diamonds$cut == "Good" & diamonds$color == "D", ]$price, na.rm = T)
[1] 3405.382
There is the way in multiple lines, creating intermediary objects
<- diamonds$cut == "Good" & diamonds$color == "D"
condition <- diamonds[condition, ]$price # could have use pull
sub_prices mean(sub_prices, na.rm = T)
%>%
The “pipe” instruction %>%
allows to write this sequential instruction in an easy to read way, without creating intermediary objects.
x %>% f()
is equivalent to f(x)
x %>% f(y)
is equivalent to f(x, y)
When you read code, %>%
is pronounced “then”
# Same example as before
%>% # We take the data, then
diamonds ::filter(cut == "Good", color == "D") %>% # Subsetting (using dplyr)
dplyr::pull(price) %>% # extract the price
dplyrmean(na.rm = T) # compute the mean, omitting the NAs
[1] 3405.382
$
Rstudio
for %>%
is Ctrl + Maj + M
%>%
If you need to specify specifically which argument you want your input to go to , you can use the dot “.”
# Adjusting a linear regression on diamonds price w.r.t. carat
%>%
diamonds lm(data = ., formula = price ~ carat) # The "." refers to entering argument
Some advices (from the bible of tidyverse)
Use %>%
to emphasize a sequence of actions, rather than the object that the actions are being performed on.
Avoid using the pipe when:
dplyr
packagelibrary(dplyr)
dplyr
is a package (part of tidyverse
) which allows you to solve the vast majority of your data-manipulation challenge:
create variables
pick variables
reorder observations
pick observations
create summaries
…
Functions in this package are verbs and have consistent structures.
slice()
slice()
tibble
(or a data.frame
)%>%
diamonds slice(c(3, 8)) # Select lines 3 and 8
# A tibble: 2 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
filter()
dplyr::filter
allows to return rows with matching conditionsdata(diamonds, package = "ggplot2") # data set use
%>%
diamonds ::filter(color == "D", clarity == "SI1")# the , is equivalent to an "&" dplyr
# A tibble: 2,083 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
2 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
3 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
4 0.75 Very Good D SI1 63.2 56 2760 5.8 5.75 3.65
5 0.71 Very Good D SI1 63.6 58 2764 5.64 5.68 3.6
6 0.71 Ideal D SI1 61.9 59 2764 5.69 5.72 3.53
7 0.73 Very Good D SI1 60.2 56 2768 5.83 5.87 3.52
8 0.7 Very Good D SI1 61.1 58 2768 5.66 5.73 3.48
9 0.72 Ideal D SI1 60.8 57 2782 5.76 5.75 3.5
10 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
# … with 2,073 more rows
# Equivalent to the one before
%>%
diamonds ::filter(color == "D" & clarity == "SI1") dplyr
# A tibble: 2,083 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
2 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
3 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
4 0.75 Very Good D SI1 63.2 56 2760 5.8 5.75 3.65
5 0.71 Very Good D SI1 63.6 58 2764 5.64 5.68 3.6
6 0.71 Ideal D SI1 61.9 59 2764 5.69 5.72 3.53
7 0.73 Very Good D SI1 60.2 56 2768 5.83 5.87 3.52
8 0.7 Very Good D SI1 61.1 58 2768 5.66 5.73 3.48
9 0.72 Ideal D SI1 60.8 57 2782 5.76 5.75 3.5
10 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
# … with 2,073 more rows
# One with a "or" (using the in)
%>%
diamonds ::filter(color %in% c("D", "E"), clarity == "SI1") dplyr
# A tibble: 4,509 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
4 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
5 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
6 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
7 0.75 Very Good D SI1 63.2 56 2760 5.8 5.75 3.65
8 0.75 Premium E SI1 59.9 54 2760 6 5.96 3.58
9 0.74 Ideal E SI1 62.3 54 2762 5.8 5.83 3.62
10 0.71 Very Good D SI1 63.6 58 2764 5.64 5.68 3.6
# … with 4,499 more rows
select()
select
works on column number.%>%
diamonds select(1, 4) # Selections des colonnes 1 et 4
# A tibble: 53,940 x 2
carat clarity
<dbl> <ord>
1 0.23 SI2
2 0.21 SI1
3 0.23 VS1
4 0.29 VS2
5 0.31 SI2
6 0.24 VVS2
7 0.24 VVS1
8 0.26 SI1
9 0.22 VS2
10 0.23 VS1
# … with 53,930 more rows
select
works on column names.%>%
diamonds select(carat, cut, depth)
# A tibble: 53,940 x 3
carat cut depth
<dbl> <ord> <dbl>
1 0.23 Ideal 61.5
2 0.21 Premium 59.8
3 0.23 Good 56.9
4 0.29 Premium 62.4
5 0.31 Good 63.3
6 0.24 Very Good 62.8
7 0.24 Very Good 62.3
8 0.26 Very Good 61.9
9 0.22 Fair 65.1
10 0.23 Very Good 59.4
# … with 53,930 more rows
One can remove column using select
, by adding -
in front of the column name:
%>%
diamonds select(-carat, -cut) # Select all columns but carat and cut
# A tibble: 53,940 x 8
color clarity depth table price x y z
<ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 E SI2 61.5 55 326 3.95 3.98 2.43
2 E SI1 59.8 61 326 3.89 3.84 2.31
3 E VS1 56.9 65 327 4.05 4.07 2.31
4 I VS2 62.4 58 334 4.2 4.23 2.63
5 J SI2 63.3 58 335 4.34 4.35 2.75
6 J VVS2 62.8 57 336 3.94 3.96 2.48
7 I VVS1 62.3 57 336 3.95 3.98 2.47
8 H SI1 61.9 55 337 4.07 4.11 2.53
9 E VS2 65.1 61 337 3.87 3.78 2.49
10 H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
select_if()
select_if
%>%
diamonds select_if(.predicate = is.factor) # Select only factor columns, the .predicate is optional
# A tibble: 53,940 x 3
cut color clarity
<ord> <ord> <ord>
1 Ideal E SI2
2 Premium E SI1
3 Good E VS1
4 Premium I VS2
5 Good J SI2
6 Very Good J VVS2
7 Very Good I VVS1
8 Very Good H SI1
9 Fair E VS2
10 Very Good H VS1
# … with 53,930 more rows
%>%
diamonds select_if(is.numeric) # Select only numeric columns
# A tibble: 53,940 x 7
carat depth table price x y z
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 61.5 55 326 3.95 3.98 2.43
2 0.21 59.8 61 326 3.89 3.84 2.31
3 0.23 56.9 65 327 4.05 4.07 2.31
4 0.29 62.4 58 334 4.2 4.23 2.63
5 0.31 63.3 58 335 4.34 4.35 2.75
6 0.24 62.8 57 336 3.94 3.96 2.48
7 0.24 62.3 57 336 3.95 3.98 2.47
8 0.26 61.9 55 337 4.07 4.11 2.53
9 0.22 65.1 61 337 3.87 3.78 2.49
10 0.23 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
select_at()
select_at
. It uses the dplyr
function vars
that allows to enclose text expressions.%>%
diamonds select_at(.vars = vars(starts_with("c"))) # Select columns starting with "c"
# A tibble: 53,940 x 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# … with 53,930 more rows
pull()
%>%
diamonds pull(cut) %>% # extract the cut variable by name as a vector
head(n = 5) # only print first 5 values
[1] Ideal Premium Good Premium Good
Levels: Fair < Good < Very Good < Premium < Ideal
%>%
diamonds pull(2) %>% # extract the cut by number variable as a vector
head(n = 5)
[1] Ideal Premium Good Premium Good
Levels: Fair < Good < Very Good < Premium < Ideal
rename()
%>%
diamonds rename(length = x, # Always New = old
width = y) # You can rename multiple columns at once
# A tibble: 53,940 x 10
carat cut color clarity depth table price length width z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
rename_if()
rename_if
base on a condition# Add the "_fact" suffix to every factor column name
%>%
diamonds rename_if(.predicate = is.factor, .funs = function(x) paste0(x, "_fact"))
# A tibble: 53,940 x 10
carat cut_fact color_fact clarity_fact depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
rename_at()
rename_at()
# Putting in upper case columns starting with "c"
%>%
diamonds rename_at(.vars = vars(starts_with("c")), .funs = toupper)
# A tibble: 53,940 x 10
CARAT CUT COLOR CLARITY depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
mutate()
%>%
diamonds mutate(cut = factor(cut, labels = LETTERS[5:1])) # modifying labels of cut
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 A E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 B E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 D E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 B I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 D J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 C J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 C I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 C H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 E E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 C H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
%>%
diamonds mutate(z_square = z^2) # creates a new column as function of an existing one
# A tibble: 53,940 x 11
carat cut color clarity depth table price x y z z_square
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 5.90
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 5.34
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 5.34
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 6.92
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 7.56
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 6.15
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 6.10
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 6.40
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 6.20
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 5.71
# … with 53,930 more rows
%>%
diamonds mutate(cut = factor(cut, labels = LETTERS[5:1]), # modifying labels of cut
y = 2 * y, # doubling the width
z_square = z^2, # creating a column z_square
cut = z_square) # This modification overrides the first one
# A tibble: 53,940 x 11
carat cut color clarity depth table price x y z z_square
<dbl> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 5.90 E SI2 61.5 55 326 3.95 7.96 2.43 5.90
2 0.21 5.34 E SI1 59.8 61 326 3.89 7.68 2.31 5.34
3 0.23 5.34 E VS1 56.9 65 327 4.05 8.14 2.31 5.34
4 0.29 6.92 I VS2 62.4 58 334 4.2 8.46 2.63 6.92
5 0.31 7.56 J SI2 63.3 58 335 4.34 8.7 2.75 7.56
6 0.24 6.15 J VVS2 62.8 57 336 3.94 7.92 2.48 6.15
7 0.24 6.10 I VVS1 62.3 57 336 3.95 7.96 2.47 6.10
8 0.26 6.40 H SI1 61.9 55 337 4.07 8.22 2.53 6.40
9 0.22 6.20 E VS2 65.1 61 337 3.87 7.56 2.49 6.20
10 0.23 5.71 H VS1 59.4 61 338 4 8.1 2.39 5.71
# … with 53,930 more rows
mutate_if()
mutate_if
%>%
diamonds mutate_if(is.factor, as.character) # Transform factor columns to character
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
cut
color
and clarity
!mutate_at()
mutate_at
%>%
diamonds mutate_at(vars(starts_with("c")),
.funs = function(x) 0) # Put columns to 0
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0 0 0 0 61.5 55 326 3.95 3.98 2.43
2 0 0 0 0 59.8 61 326 3.89 3.84 2.31
3 0 0 0 0 56.9 65 327 4.05 4.07 2.31
4 0 0 0 0 62.4 58 334 4.2 4.23 2.63
5 0 0 0 0 63.3 58 335 4.34 4.35 2.75
6 0 0 0 0 62.8 57 336 3.94 3.96 2.48
7 0 0 0 0 62.3 57 336 3.95 3.98 2.47
8 0 0 0 0 61.9 55 337 4.07 4.11 2.53
9 0 0 0 0 65.1 61 337 3.87 3.78 2.49
10 0 0 0 0 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
arrange()
%>%
diamonds arrange(carat, depth) # Sorting on carat, then on depth
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24
2 0.2 Premium E VS2 59.7 62 367 3.84 3.8 2.28
3 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.3
4 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
5 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
6 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32
7 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33
8 0.2 Premium D VS2 61.7 60 367 3.77 3.72 2.31
9 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33
10 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31
# … with 53,930 more rows
# Decreasing sort uses the function desc()
%>%
diamonds arrange(desc(carat, depth)) # Decreasing on carat, increasing on depth
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
4 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
5 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
6 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
7 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13
8 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
9 3.51 Premium J VS2 62.5 59 18701 9.66 9.63 6.03
10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
# … with 53,930 more rows
summarise()
tibble
.%>%
diamonds summarise(mean_carat = mean(carat),
var_carat = var(carat),
number_diamonds = n(), # dplyr::n() equivalent to nrow(.)
number_distinct_carats = n_distinct(carat))# ddplyr::n_distinct()
# A tibble: 1 x 4
mean_carat var_carat number_diamonds number_distinct_carats
<dbl> <dbl> <int> <int>
1 0.798 0.225 53940 273
summarise_if()
summarise_if()
and summarise_at()
exists!# Compute the mean of all numeric variables
%>%
diamonds summarise_if(is.numeric, mean)
# A tibble: 1 x 7
carat depth table price x y z
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 61.7 57.5 3933. 5.73 5.73 3.54
%>%
diamonds summarise_at(.vars = vars(starts_with("c")), # For columns starting by c
# count number of unique elements in the column
.funs = list(n_unique = function(x) length(unique(x))))
# A tibble: 1 x 4
carat_n_unique cut_n_unique color_n_unique clarity_n_unique
<int> <int> <int> <int>
1 273 5 7 8
group_by()
group_by
wiill allow to perform the summarise per factor of a group%>%
diamonds group_by(cut) %>% # depending on the cut
summarise(Count = n(),# Number of diamonds per group
Mean_price = mean(price),# Mean price per group
Mean_carat = var(carat))
# A tibble: 5 x 4
cut Count Mean_price Mean_carat
<ord> <int> <dbl> <dbl>
1 Fair 1610 4359. 0.267
2 Good 4906 3929. 0.206
3 Very Good 12082 3982. 0.211
4 Premium 13791 4584. 0.265
5 Ideal 21551 3458. 0.187
group_by
changes the way dplyr
functions operate on the data. They will operate groupwise.In the diamonds
data set
x
, y
, z
variables as length
, width
, height
.diamonds_modif
containing these three columns and the clarity
, cut
, carat
and price
, when for the cut is Premium
or Ideal
.cut/color
, compute the mean of the three variables length
, width
, height
. Keep this summary in an object dim_summary
.diamonds_modif
as wanted.<- diamonds %>%
diamonds_modif rename(length = x, width = y, height = z) %>%
mutate_at(.vars = c("length", "width", "height"), # Change this column
.funs = function(x) x / 10) %>% # Applying them the same function
filter(cut %in% c("Premium", "Ideal")) %>% # Filtering on the cut
select(-depth, -color, -table) # Exclude non wanted columns
Then, we use summarise()
and a group_by
!
<- diamonds_modif %>%
dim_summary group_by(cut, clarity) %>%
summarise_at(c("length", "width", "height"), mean)
# A tibble: 16 x 5
# Groups: cut [2]
cut clarity length width height
<ord> <ord> <dbl> <dbl> <dbl>
1 Premium I1 0.684 0.679 0.413
2 Premium SI2 0.657 0.654 0.400
3 Premium SI1 0.605 0.601 0.369
4 Premium VS2 0.583 0.581 0.357
5 Premium VS1 0.574 0.571 0.351
6 Premium VVS2 0.539 0.537 0.330
7 Premium VVS1 0.504 0.502 0.308
8 Premium IF 0.523 0.522 0.319
9 Ideal I1 0.675 0.674 0.416
10 Ideal SI2 0.626 0.627 0.387
11 Ideal SI1 0.578 0.579 0.357
12 Ideal VS2 0.543 0.543 0.335
13 Ideal VS1 0.545 0.547 0.337
14 Ideal VVS2 0.521 0.523 0.322
15 Ideal VVS1 0.496 0.498 0.306
16 Ideal IF 0.483 0.485 0.298
# If required
# install.packages("nycflights13")
# Data on New York city flights and airports
library(nycflights13) # New York city flights of 2013 data set
Error in library(nycflights13): aucun package nommé 'nycflights13' n'est trouvé
<- airports %>%
table_1 select(name, lon, lat) %>% # Select those 3 columns
slice(1:2) # 2 first rows
Error in select(., name, lon, lat): objet 'airports' introuvable
<- airports %>%
table_2 select(name, lon, lat) %>% # Same columns as table_1, different rows
slice(6:7)
Error in select(., name, lon, lat): objet 'airports' introuvable
<- airports %>%
table_3 select(alt, tz) %>%
slice(1:2) # Same rows as table_1, different columns
Error in select(., alt, tz): objet 'airports' introuvable
bind_rows()
and bind_cols()
%>% # Binding table_2 below table_1
table_1 bind_rows(table_2) # Column names must match!
Error in list2(...): objet 'table_1' introuvable
%>% # Binding table_3 next to table_1
table_1 bind_cols(table_3) # Number of rows must match!
Error in list2(...): objet 'table_1' introuvable
left_join()
<- tibble(name = c("Sue", "Sue", "Marc", "Gunter", "Rayan", "Rayan"),
drivers car = c("Clio", "ZX", "AX", "Lada", "Twingo", "Clio"))
<- tibble(car = c("Twingo", "Ferrari", "Clio", "Lada", "ZX"),
vehicles speed = c("140", "280", "160", "85", "160"))
drivers
# A tibble: 6 x 2
name car
<chr> <chr>
1 Sue Clio
2 Sue ZX
3 Marc AX
4 Gunter Lada
5 Rayan Twingo
6 Rayan Clio
vehicles
# A tibble: 5 x 2
car speed
<chr> <chr>
1 Twingo 140
2 Ferrari 280
3 Clio 160
4 Lada 85
5 ZX 160
left_join(drivers, vehicles, by = "car")
# A tibble: 6 x 3
name car speed
<chr> <chr> <chr>
1 Sue Clio 160
2 Sue ZX 160
3 Marc AX <NA>
4 Gunter Lada 85
5 Rayan Twingo 140
6 Rayan Clio 160
NA
is inserted.left_join()
is not symmetricleft_join(vehicles, drivers, by = "car")
# A tibble: 6 x 3
car speed name
<chr> <chr> <chr>
1 Twingo 140 Rayan
2 Ferrari 280 <NA>
3 Clio 160 Sue
4 Clio 160 Rayan
5 Lada 85 Gunter
6 ZX 160 Sue
left_join(x, y)
is equivalent to right_join(y, x)
inner_join()
inner_join()
only keeps lines where info is present on both tables<- inner_join(drivers, vehicles, by = "car") inner_jointure
Marc
and Ferrari
lines are missing.full_join()
full_join()
keeps all lines, putting NA
if necessary<- full_join(drivers, vehicles, by = "car") full_jointure
Marc
and Ferrari
lines are present, with NA
.For the diamonds_modif
object, compute, for each combination cut/clarity
, the number of observations per group. Include it in the table dim_summary
created in the previous exercise (previous section).
%>%
diamonds_modif group_by(cut, clarity) %>%
summarise(Nb_obs = n()) %>%
left_join(dim_summary, by = c("cut", "clarity"))
# A tibble: 16 x 6
# Groups: cut [2]
cut clarity Nb_obs length width height
<ord> <ord> <int> <dbl> <dbl> <dbl>
1 Premium I1 205 0.684 0.679 0.413
2 Premium SI2 2949 0.657 0.654 0.400
3 Premium SI1 3575 0.605 0.601 0.369
4 Premium VS2 3357 0.583 0.581 0.357
5 Premium VS1 1989 0.574 0.571 0.351
6 Premium VVS2 870 0.539 0.537 0.330
7 Premium VVS1 616 0.504 0.502 0.308
8 Premium IF 230 0.523 0.522 0.319
9 Ideal I1 146 0.675 0.674 0.416
10 Ideal SI2 2598 0.626 0.627 0.387
11 Ideal SI1 4282 0.578 0.579 0.357
12 Ideal VS2 5071 0.543 0.543 0.335
13 Ideal VS1 3589 0.545 0.547 0.337
14 Ideal VVS2 2606 0.521 0.523 0.322
15 Ideal VVS1 2047 0.496 0.498 0.306
16 Ideal IF 1212 0.483 0.485 0.298
When you have a dataset, there are three interrelated rules which make your dataset tidy:
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
This mean that you must have a clear idea of what is a variable, an observation, or even a value!
<- tibble(
grades Name = c("Tommy", "Mary", "Gary", "Cathy"),
Sex_age = c("m_15", "f_15", "m_16", "f_14"),
Test1 = c(10, 15, 16, 14),
Test2 = c(11, 13, 10, 12),
Test3 = c(12, 13, 17, 10)
)
Name | Sex_age | Test1 | Test2 | Test3 |
---|---|---|---|---|
Tommy | m_15 | 10 | 11 | 12 |
Mary | f_15 | 15 | 13 | 13 |
Gary | m_16 | 16 | 10 | 17 |
Cathy | f_14 | 14 | 12 | 10 |
One can spot multiple potential problems:
Sex_age
column gathers two variablesGrade
whose value is a grade (and might depend on which test was performed!)separate()
Sex_age
column should be two columns (two variables)<- # grades with two columns sex age
grades_s_a %>%
grades separate(Sex_age, into = c("Sex", "Age"), sep = "_")
print(grades_s_a)
# A tibble: 4 x 6
Name Sex Age Test1 Test2 Test3
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Tommy m 15 10 11 12
2 Mary f 15 15 13 13
3 Gary m 16 16 10 17
4 Cathy f 14 14 12 10
sep
argument can handle regular expressions.unite()
unite()
%>%
grades_s_a unite(col = "Sex_age", Sex, Age, sep = "_") %>%
::kable(format = "html") knitr
Name | Sex_age | Test1 | Test2 | Test3 |
---|---|---|---|---|
Tommy | m_15 | 10 | 11 | 12 |
Mary | f_15 | 15 | 13 | 13 |
Gary | m_16 | 16 | 10 | 17 |
Cathy | f_14 | 14 | 12 | 10 |
pivot_longer()
grade_s_a
, the a same variable (the grade) is coded in three columns.pivot_longer
.<- grades_s_a %>%
grades_long pivot_longer(starts_with("Test"), # Gathered columns
names_to = "Test", # Column gathering old column names
values_to = "Grade") # Column gathering values
grades_long
# A tibble: 12 x 5
Name Sex Age Test Grade
<chr> <chr> <chr> <chr> <dbl>
1 Tommy m 15 Test1 10
2 Tommy m 15 Test2 11
3 Tommy m 15 Test3 12
4 Mary f 15 Test1 15
5 Mary f 15 Test2 13
6 Mary f 15 Test3 13
7 Gary m 16 Test1 16
8 Gary m 16 Test2 10
9 Gary m 16 Test3 17
10 Cathy f 14 Test1 14
11 Cathy f 14 Test2 12
12 Cathy f 14 Test3 10
ggplot
.grades_gathered
is now \(12\times 5\).<- grades_s_a %>%
grades_long pivot_longer(cols = -c("Name", "Age", "Sex"), # non gathered columns
names_to = "Test", # Column gathering old column names
values_to = "Grade") # Column gathering values
grades_long
# A tibble: 12 x 5
Name Sex Age Test Grade
<chr> <chr> <chr> <chr> <dbl>
1 Tommy m 15 Test1 10
2 Tommy m 15 Test2 11
3 Tommy m 15 Test3 12
4 Mary f 15 Test1 15
5 Mary f 15 Test2 13
6 Mary f 15 Test3 13
7 Gary m 16 Test1 16
8 Gary m 16 Test2 10
9 Gary m 16 Test3 17
10 Cathy f 14 Test1 14
11 Cathy f 14 Test2 12
12 Cathy f 14 Test3 10
%>% # Initial data
grades_long group_by(Sex, Test) %>% # Grouping of interest
summarise(Mean = mean(Grade)) # Get the summary of interest, per group
# A tibble: 6 x 3
# Groups: Sex [2]
Sex Test Mean
<chr> <chr> <dbl>
1 f Test1 14.5
2 f Test2 12.5
3 f Test3 11.5
4 m Test1 13
5 m Test2 10.5
6 m Test3 14.5
pivot_wider()
pivot_longer
is pivot_wider()
# The sep argument is used to give correct names to column
%>%
grades_long pivot_wider(names_from = "Test", values_from = "Grade")
# A tibble: 4 x 6
Name Sex Age Test1 Test2 Test3
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Tommy m 15 10 11 12
2 Mary f 15 15 13 13
3 Gary m 16 16 10 17
4 Cathy f 14 14 12 10
complete()
<- tibble(Nom = c("Alain", "Alain", "Benoit", "Claire"),
notes Discipline = c("Maths", "Francais", "Maths", "Francais"),
Note = c(16, 9, 17, 11),
Present = rep("oui", 4))
Nom | Discipline | Note | Present |
---|---|---|---|
Alain | Maths | 16 | oui |
Alain | Francais | 9 | oui |
Benoit | Maths | 17 | oui |
Claire | Francais | 11 | oui |
%>%
notes complete(Nom, Discipline)
# A tibble: 6 x 4
Nom Discipline Note Present
<chr> <chr> <dbl> <chr>
1 Alain Francais 9 oui
2 Alain Maths 16 oui
3 Benoit Francais NA <NA>
4 Benoit Maths 17 oui
5 Claire Francais 11 oui
6 Claire Maths NA <NA>
NA
.<- notes %>%
notes_completes complete(Nom, Discipline,
fill = list(Note = 0, Present = "non")) # Custom the value per column
Nom | Discipline | Note | Present |
---|---|---|---|
Alain | Francais | 9 | oui |
Alain | Maths | 16 | oui |
Benoit | Francais | 0 | non |
Benoit | Maths | 17 | oui |
Claire | Francais | 11 | oui |
Claire | Maths | 0 | non |
replace_na
Consider the following table which as NA
<- tibble(Groupe = rep(c("A", "B"), each = 3),
donnees_na Nom = c("Al", "Bob", NA, "Dave", "Elle", "Fanch"),
Note = c(NA, 8, 7, 4.5, 1, 4))
donnees_na
# A tibble: 6 x 3
Groupe Nom Note
<chr> <chr> <dbl>
1 A Al NA
2 A Bob 8
3 A <NA> 7
4 B Dave 4.5
5 B Elle 1
6 B Fanch 4
NA
.replace_na
.%>%
donnees_na replace_na(replace = list(Nom ="Unknown",
Note = 0)) # Assign same value for all columns
# A tibble: 6 x 3
Groupe Nom Note
<chr> <chr> <dbl>
1 A Al 0
2 A Bob 8
3 A Unknown 7
4 B Dave 4.5
5 B Elle 1
6 B Fanch 4
mutate_if
, one can do several things# Replace NA only in numeric column
%>%
donnees_na mutate_if(is.numeric, # We change only if it's numeric
# The treatment is made column wise
function(colonne) replace_na(colonne, replace = 0))
# A tibble: 6 x 3
Groupe Nom Note
<chr> <chr> <dbl>
1 A Al 0
2 A Bob 8
3 A <NA> 7
4 B Dave 4.5
5 B Elle 1
6 B Fanch 4
# Replace NA by the mean of the numeric column
%>%
donnees_na mutate_if(is.numeric, # We change only if it's numeric
# The treatment is made column wise
function(colonne) replace_na(colonne,
replace = mean(colonne, na.rm = TRUE)))
# A tibble: 6 x 3
Groupe Nom Note
<chr> <chr> <dbl>
1 A Al 4.9
2 A Bob 8
3 A <NA> 7
4 B Dave 4.5
5 B Elle 1
6 B Fanch 4
group_by
, you can replace the NA
by the mean of the group (here, the group A)# Replace NA by the group mean
%>%
donnees_na group_by(Groupe) %>% # We first group by the required group
# And then apply the same treatment
mutate_if(is.numeric, # We change only if it's numeric
# The treatment is made column wise
function(colonne) replace_na(colonne,
replace = mean(colonne, na.rm = TRUE)))
# A tibble: 6 x 3
# Groups: Groupe [2]
Groupe Nom Note
<chr> <chr> <dbl>
1 A Al 7.5
2 A Bob 8
3 A <NA> 7
4 B Dave 4.5
5 B Elle 1
6 B Fanch 4
From the diamonds_modif
table of the previous exercise, create the diamonds_gathered
table having the following the following column structure:
# A tibble: 5 x 5
price cut clarity Dimension Value
<int> <ord> <ord> <chr> <dbl>
1 10096 Premium SI1 width 0.731
2 655 Ideal VVS1 width 0.439
3 593 Premium SI2 height 0.264
4 765 Ideal VS1 width 0.443
5 3543 Ideal VS2 length 0.573
<- diamonds_modif %>%
diamonds_gathered select(price, cut, clarity, length, width, height) %>%
gather(key = "Dimension", value = "Value", length, width, height)
stringr
packageThe stringr
provides powerful functions to handle characters.
For instance, let’s consider the following data.
For a full cover of all these possibilities, there are are lot of dedicated pages, for instance, this one.
<- tibble(Nom = c("Mr_Al_Bob", "Mr_Bob_Col"),
example Age = c(42, 41))
Mr
by a M.
str_replace
function%>%
example mutate(Nom = str_replace(Nom, pattern = "Mr", replacement = "M."))
# A tibble: 2 x 2
Nom Age
<chr> <dbl>
1 M._Al_Bob 42
2 M._Bob_Col 41
_
by a space%>%
example mutate(Nom = str_replace_all(Nom, pattern = c("_"), replacement = c(" ")))
# A tibble: 2 x 2
Nom Age
<chr> <dbl>
1 Mr Al Bob 42
2 Mr Bob Col 41
Mr
in front, with str_remove
%>%
example mutate(Nom = str_replace_all(Nom, pattern = c("_"), replacement = c(" ")),
Nom = str_remove(Nom, pattern = "Mr "))
# A tibble: 2 x 2
Nom Age
<chr> <dbl>
1 Al Bob 42
2 Bob Col 41
Let’s consider the following table
<- tibble(Name = c("Alice", "Bob", "Charles", "Dan", "Elsa", "Fanch"),
athletes Performance = c(15, 10, 5, 18, 11, 4),
Aptitude = c("Strong", "Intermediate", "Weak",
"Strong", "Intermediate", "Weak"),
Qualified = c(1, 1, 0, 1, 1, 0))
athletes
# A tibble: 6 x 4
Name Performance Aptitude Qualified
<chr> <dbl> <chr> <dbl>
1 Alice 15 Strong 1
2 Bob 10 Intermediate 1
3 Charles 5 Weak 0
4 Dan 18 Strong 1
5 Elsa 11 Intermediate 1
6 Fanch 4 Weak 0
The column Qualified
is here considered a a numeric, whereas it’s a qualitative, 1 meaning “yes”, 0 meaning no.
This can pose problem either for graphical representation or, worse, for model fitting in R
.
# We want a boxplot, with one box per level of Qualified
ggplot(athletes) +
aes(x = Qualified, y = Performance) +
geom_boxplot() # Do not work, as Qualified is coded as numeric!
Warning: Continuous x aesthetic -- did you forget aes(group=...)?
To change it, we will change (using mutate
) the column to factor.
<- athletes %>%
athletes mutate(Qualified = factor(Qualified)) # Changed to qualified
ggplot(athletes) +
aes(x = Qualified, y = Performance) +
geom_boxplot() # Now it works
The factor Qualified now has two levels, which are 0 and 1. In the graph above, it would be way better if 0/1 was understandable by a user. We can change it using the labels
argument to relabel the levels. Watch out, the labels must be specified in the same order than the levels
%>%
athletes mutate(Qualified = factor(Qualified,
levels = c(0, 1), # Specify the level order
labels = c("No", "Yes"))) %>% # And rename in the same order
ggplot() +
aes(x = Qualified, y = Performance) +
geom_boxplot() # Now it works
For plots or model, there is no absolute need to transform characters in factor, as R
understands that it is qualitative attribute.
# A boxplot of performance per Aptitude
ggplot(athletes) +
aes(x = Aptitude, y = Performance) +
geom_boxplot()
Here, the order of the levels is given by alphabetical order, and is not natural. We then can transform Aptitude to factor to specify the right levels order.
%>%
athletes mutate(Aptitude = factor(Aptitude,
# Now specify the levels order
# Caution, the levels given must appear in the data!
# f you want to change labels, use labels = ...
levels = c("Weak", "Intermediate", "Strong"))) %>%
# A boxplot of performance per Aptitude
ggplot() +
aes(x = Aptitude, y = Performance) +
geom_boxplot() # The levels are now in natural order