Make sure you have version of R > 3.1.0 and install the following package:
install.packages("dplyr")
The structure that corresponds the most to a Stata datase is a tibble
.
N <- 100
df <- tibble(
id = sample(c("id01", "id02", "id03"), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(round(runif(100, max = 100), 4), N, TRUE)
)
To select a few columns from a dataset:
Stata | keep id v1 |
dplyr | df %>% select(id, v1) |
In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:
Stata | keep v* |
dplyr | select(df, starts_with("v")) |
This table gives the list of helper functions:
Stata | dplyr |
---|---|
keep v* | select(df, starts_with(“v”)) |
keep *v | select(df, ends_with(“v”)) |
keep *v* | select(df, contains(“v”)) |
keep v? | select(df, matches(“^v.$”)) |
keep * | select(df, everything()) |
drop v1 | select(df, -v1) |
keep id-v2 | select(df, id:v2) |
To rename columns
Stata | rename id id1 |
dplyr | df %>% rename(id1 = id) |
To reorder columns,
Stata | order v1 |
dplyr | df %>% select(DT, v1, everything()) |
To create new columns
Stata | gen new = 1 |
dplyr | df %>% mutate(new = 1) |
To modify a column
Stata | egen cov = cov(v1, v2) |
dplyr | df %>% mutate(cov = cov(v1, v2)) |
To modify only certain rows of a column:
Stata | replace v1 = 0 if id =="id01" |
dplyr | df %>% mutate(v1 = ifelse(id == "id01", 0, v1)) |
To apply the same function to multiple columns, use across
Stata | tostring v1 v2, replace force |
dplyr | df %>% mutate(across(c(v1, v2), as.character)) |
The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use summarize
instead of mutate
To return a dataset composed of summary statistics computed over multiple rows :
Stata | collapse (mean) v1 (sd) v2 |
dplyr | df %>% summarize(mean(v1, na.rm = TRUE), sd(v2, na.rm = TRUE)) |
To apply each function to multiple variables:
Stata | collapse (mean) v* (sd) v* |
dplyr | df %>% summarize(across(starts_with("v"), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))) |
Compared to Stata, these commands don’t overwrite the existing dataset.
You can filter rows using logical conditions
Stata | keep if v1 >= 2 |
dplyr | df %>% filter(v1 >= 2) |
You can also filter rows based on their position:
Stata | keep if _n <= 100 |
dplyr | df %>% filter(row_number() <= 100) |
The equivalent of Stata inlist
is %in%
Stata | keep if inlist(id, "id01", "id02") |
dplyr | df %>% filter(id %in% c("id01", "id02")) |
The equivalent of Stata inrange
is between
Stata | keep if inrange(v1, 3, 5) |
dplyr | df %>% filter(between(v2, 3, 5)) |
In Stata, missing values behave like +Inf
. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.
NA + 1
#> NA
TRUE | NA
#> [1] TRUE
Use is.na
to test for missing values
1 == NA
#> [1] NA
is.na(NA)
#> [1] 1
In Stata, the empty character “” is a missing value. This is not true in R:
is.na("")
#> [1] FALSE
To filter rows with missing observations for y
:
df <- tibble(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
df %>% filter(!is.na(y))
filter(df, condition)
only filters rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata
df <- tibble(x = c(1, 2, NA))
#> v
#> 1 1
#> 2 2
#> 3 NA
filter(df, x >= 2))
#> x
#> 1 2
filter(df, !(x == 1))
#> x
#> 1 2
To sort rows
Stata | sort id v1 |
dplyr | arrange(df, id, v1) |
Missing values are sorted last, like in Stata.