Table Of ContentData Wrangling Tidy Data
- A foundation for wrangling in R
with dplyr and tidyr F M A F M A M * A F
Tidy data complements R’s vectorized
&
In a tidy
Cheat Sheet operations. R will automatically preserve
data set:
observations as you manipulate variables.
Each variable is saved Each observation is No other format works as intuitively with R.
M * A
in its own column saved in its own row
Syntax Reshaping Data
- Helpful conventions for wrangling - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
ww ww
ww ww Combine vectors into data frame
Converts data to tbl class. tbl’s are easier to examine than
ww ww (optimized).
data frames. R displays only the data that fits onscreen: 1A005 1A005
1A013 1A013 dplyr::arrange(mtcars, mpg)
1A010 1A010
1A010 1A010 Order rows by values of a column
tidyr::gather(cases, "year", "n", 2:4) tidyr::spread(pollution, size, amount)
Source: local data frame [150 x 5]
(low to high).
Gather columns into rows. Spread rows into columns.
Sepal.Length Sepal.Width Petal.Length dplyr::arrange(mtcars, desc(mpg))
1 5.1 3.5 1.4
2 4.9 3.0 1.4 Order rows by values of a column
3 4.7 3.2 1.3 wwp wwp wwp wwp (high to low).
4 4.6 3.1 1.5
11111000017111100007 11111000071111100007
5 5.0 3.6 1.4 4145050491450509 4145050941450509 dplyr::rename(tb, y = year)
.. ... ... ... tidyr::separate(storms, date, c("y", "m", "d")) tidyr::unite(data, col, ..., sep)
Rename the columns of a data
Variables not shown: Petal.Width (dbl),
Species (fctr) Separate one column into several. Unite several columns into one. frame.
dplyr::glimpse(iris) Subset Observations (Rows) Subset Variables (Columns)
Information dense summary of tbl data.
utils::View(iris)
wwwww wwww
wppw
111110110110100
View data set in spreadsheet-like display (note capital V). 11110001001770
141050040959
dplyr::filter(iris, Sepal.Length > 7) dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Extract rows that meet logical criteria. Select columns by name or helper function.
dplyr::distinct(iris)
Helper functions for select - ?select
Remove duplicate rows.
select(iris, contains("."))
dplyr::sample_frac(iris, 0.5, replace = TRUE) Select columns whose name contains a character string.
Randomly select fraction of rows. select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
dplyr::sample_n(iris, 10, replace = TRUE)
select(iris, everything())
dplyr::%>% Randomly select n rows. Select every column.
Passes object on left hand side as first argument (or . dplyr::slice(iris, 10:15) select(iris, matches(".t."))
Select columns whose name matches a regular expression.
argument) of function on righthand side. Select rows by position.
select(iris, num_range("x", 1:5))
dplyr::top_n(storms, 2, date)
Select columns named x1, x2, x3, x4, x5.
x %>% f(y) is the same as f(x, y)
Select and order top n entries (by group if grouped data). select(iris, one_of(c("Species", "Genus")))
y %>% f(x, ., z) is the same as f(x, y, z )
Select columns whose names are in a group of names.
Logic in R - ?Comparison, ?base::Logic
select(iris, starts_with("Sepal"))
"Piping" with %>% makes code more readable, e.g. < Less than != Not equal to Select columns whose name starts with a character string.
> Greater than %in% Group membership select(iris, Sepal.Length:Petal.Width)
iris %>%
group_by(Species) %>% == Equal to is.na Is NA Select all columns between Sepal.Length and Petal.Width (inclusive).
summarise(avg = mean(Sepal.Width)) %>% <= Less than or equal to !is.na Is not NA select(iris, -Species)
arrange(avg) >= Greater than or equal to &,|,!,xor,any,all Boolean operators Select all columns except Species.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
Summarise Data Make New Variables Combine Data Sets
a b
x1 x2 x1 x3
A 1 + A T =
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::left_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank))
B 2 F Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2 dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)
A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns.
D T NA
variable (with or without weights).
dplyr::inner_join(a, b, by = "x1")
x1 x2 x3
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function
x1 x2 x3 dplyr::full_join(a, b, by = "x1")
A 1 T
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B 2 F Join data. Retain all values, all rows.
C 3 NA
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T
Filtering Joins
dplyr::first min
dplyr::lead dplyr::cumall
dplyr::semi_join(a, b, by = "x1")
First value of a vector. Minimum value in a vector. x1 x2
Copy with values shifted by 1. Cumulative all
A 1
All rows in a that have a match in b.
dplyr::last max B 2
dplyr::lag dplyr::cumany
Last value of a vector. Maximum value in a vector. dplyr::anti_join(a, b, by = "x1")
Copy with values lagged by 1. Cumulative any x1 x2
C 3
dplyr::nth mean All rows in a that do not have a match in b.
dplyr::dense_rank dplyr::cummean
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector. A 1 + B 2 =
Ranks. Ties get min rank. Cumulative sum
dplyr::n_distinct var B 2 C 3
C 3 D 4
dplyr::percent_rank cummax
# of distinct values in Variance of a vector.
Set Operations
a vector. sd Ranks rescaled to [0, 1]. Cumulative max
IQR Standard deviation of a dplyr::row_number cummin x1 x2 dplyr::intersect(y, z)
B 2
IQR of a vector. vector. Ranks. Ties got to first value. Cumulative min C 3 Rows that appear in both y and z.
dplyr::ntile cumprod x1 x2
Group Data A 1 dplyr::union(y, z)
Bin vector into n buckets. Cumulative prod B 2
C 3 Rows that appear in either or both y and z.
dplyr::group_by(iris, Species) dplyr::between pmax D 4
Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdiff(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…) x1 x2
A 1
dplyr::bind_rows(y, z)
B 2
Compute separate summary row for each group. Compute new variables by group.
C 3
B 2 Append z to y as new rows.
C 3
D 4
ir ir dplyr::bind_cols(y, z)
C
x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
Caution: matches rows by position.
C 3 D 4
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15