---
title: "R Users Breakout Session"
author: Matt Gunther - IPUMS PMA Senior Data Analyst
date: 2022-04-06
output:
ioslides_presentation:
widescreen: true
smaller: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
R.options = list(width = 100)
)
```
## Setup
R users: remember to select a **.dat (fixed-width text)** data format
You'll receive a compressed **dat.gz** file - no need to decompress!
Save both of those files in the "data" folder of your working directory.
```{r, echo = FALSE}
knitr::include_graphics("images/download.png", dpi = 150)
```
---
You'll need the [ipumsr](https://tech.popdata.org/ipumsr/index.html) package to load them. If not installed, you can download from CRAN.
```{r, eval = FALSE}
install.packages("ipumsr")
```
Each session, load the `ipumsr` library before you import data.
```{r, results='hide', message=FALSE, warning=FALSE}
library(ipumsr)
# Load data into R with `ipumsr`
dat <- read_ipums_micro(
ddi = "data/pma_00093.xml",
data = "data/pma_00093.dat.gz"
)
```
```{r, echo = FALSE}
knitr::include_graphics("images/logo.png", dpi = 150)
```
---
Other useful packages for IPUMS data:
```{r, results='hide', message=FALSE, warning=FALSE}
# General toolkit
library(tidyverse)
# For label manipulation:
library(labelled)
# For survey analysis:
library(survey)
library(srvyr)
```
```{r, echo = FALSE}
knitr::include_graphics("images/logo-ribbon.png", dpi = 150)
```
# 1 - Analytic Sample
---
PMA uses an **open panel design** - women may enter the panel after Phase 1, and they may be lost to follow-up after any phase.
See [RESULTFQ](https://pma.ipums.org/pma-action/variables/RESULTFQ)
Women who enter the panel at Phase 2 are `NA` for all variables at Phase 1.
```{r}
dat %>% count(RESULTFQ_1)
```
---
Women whose households were not found again after Phase 1 are `NA` for all variables at Phase 2.
```{r}
dat %>% count(RESULTFQ_2)
```
---
We will only include women who were available and completed the Female Questionnaire for *both* Phase 1 and Phase 2.
```{r}
dat <- dat %>% filter(RESULTFQ_1 == 1 & RESULTFQ_2 == 1)
dat %>% count(RESULTFQ_1, RESULTFQ_2)
```
---
Additionally, PMA samples are only valid for the *de facto* population: women who
slept in the household the night before the Household interview.
See [RESIDENT](https://pma.ipums.org/pma-action/variables/RESIDENT)
```{r}
dat %>% count(RESIDENT_1)
```
We'll also drop cases where the woman was not part of the *de facto* population in either Phase 1 or Phase 2.
```{r}
dat <- dat %>% filter(RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22))
```
---
How many cases remain?
```{r}
dat %>% count(COUNTRY)
```
# 2 - Recoding Independent variables
---
PMA surveys contain many **categorical** variables. These are usually represented as **factors** in R.
In an IPUMS data extract, you won't see factors!
Instead, we generate **labelled** numeric variables (note the label in brackets).
```{r}
dat %>% ipums_var_label(CVINCOMELOSS_2)
dat %>% count(CVINCOMELOSS_2)
```
---
The [ipumsr](https://tech.popdata.org/ipumsr/index.html) package contains tools for working with labelled IPUMS data.
Usually, we handle codes like `99 [NIU (not in universe)]` before transforming other missing data to `NA`.
```{r}
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
**Tip:**
Information the code `NIU (not in universe)` can always be found on a variable's [universe tab](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS#universe_section).
```{r, echo = FALSE}
knitr::include_graphics("images/universe.png", dpi = 300)
```
---
For [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS), `99 [NIU (not in universe)]` may indicate that the household experienced *no income loss in the last year*, or it may indicate that [HHINCOMELOSSAMT_2](https://pma.ipums.org/pma-action/variables/HHINCOMELOSSAMT) is `98 [No response or missing]`.
We should treat the `NIU` women from households without *any* income loss as "No" in [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS).
```{r}
dat <- dat %>%
mutate(
CVINCOMELOSS_2 = CVINCOMELOSS_2 %>%
labelled::recode_if(HHINCOMELOSSAMT_2 == 1, 0)
)
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
Next, we'll use `NA` to represent the remaining values above `90`:
* `97 [Don't know] ` and
* remaining cases marked `99 [NIU (not in universe)]`
```{r}
dat <- dat %>%
mutate(
CVINCOMELOSS_2 = CVINCOMELOSS_2 %>%
lbl_na_if(~.val > 90)
)
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
Once you're done with labels, we recommend transforming key variables into **factors** with [forcats::as_factor](https://forcats.tidyverse.org/reference/as_factor.html).
The [forcats](https://forcats.tidyverse.org) package is included when you load `library(tidyverse)`.
```{r}
dat <- dat %>% mutate(CVINCOMELOSS_2 = as_factor(CVINCOMELOSS_2))
dat %>% count(CVINCOMELOSS_2)
```
This will make categorical variables easier to use in data visualization and as "dummy" variables in regression analysis.
---
Likert-style questions can be treated as factors, too.
```{r}
dat %>% ipums_var_label(COVIDCONCERN_2)
dat %>% count(COVIDCONCERN_2)
```
---
This time we'll treat codes `5` and above as `NA`.
```{r}
dat <- dat %>%
mutate(
COVIDCONCERN_2 = COVIDCONCERN_2 %>%
lbl_na_if(~.val >= 5) %>%
as_factor()
)
dat %>% count(COVIDCONCERN_2)
```
---
You can apply the same transformation to several variables with help from [dplyr::across](https://dplyr.tidyverse.org/reference/across.html).
[dplyr]() is another package included when you load `library(tidyverse)`.
```{r}
dat <- dat %>%
mutate(
across(
c(COUNTRY, URBAN, WEALTHT_2, EDUCATTGEN_2),
~.x %>% lbl_na_if(~.val >= 90) %>% as_factor()
)
)
```
---
Often, it's important to set a **reference group** against which all dummy variables will be compared.
You can manually specify a **refernece group** when you set factor "levels" with a function like [forcats::fct_relevel](https://forcats.tidyverse.org/reference/fct_relevel.html).
```{r}
dat <- dat %>%
mutate(
AGE_2 = case_when(
AGE_2 < 25 ~ "15-24",
AGE_2 < 35 ~ "25-34",
AGE_2 < 50 ~ "35-49"
),
AGE_2 = AGE_2 %>% fct_relevel("15-24", "25-34", "35-49")
)
```
# 3 - Dependent variables
---
We'll use our recoded variables to model the likelihood of contraceptive method **adoption** and **discontinuation** between phases.
See [CP](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS)
```{r}
dat <- dat %>% filter(CP_1 < 90 & CP_2 < 90)
dat %>% count(CP_1, CP_2)
```
---
A woman has **adopted** a method if she was *not* using one at Phase 1, but then reported using one at Phase 2.
She has **discontinued** a method if she *did* use one at Phase 1, but no longer uses one at Phase 2.
```{r}
dat <- dat %>%
mutate(
FPSTATUS = case_when(
CP_1 == 1 & CP_2 == 1 ~ "User",
CP_1 == 0 & CP_2 == 0 ~ "Non-user",
CP_1 == 1 & CP_2 == 0 ~ "Discontinued",
CP_1 == 0 & CP_2 == 1 ~ "Adopted"
),
FPSTATUS = fct_infreq(FPSTATUS)
)
```
---
Un-weighted sample proportions for `FPSTATUS` can be found with [count](https://dplyr.tidyverse.org/reference/count.html) and [prop.table](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/prop.table):
```{r}
dat_nowt <- dat %>%
group_by(COUNTRY) %>%
count(FPSTATUS) %>%
mutate(prop = prop.table(n))
dat_nowt
```
We'll plot this table with [ggplot2](https://ggplot2.tidyverse.org/index.html) (also included with the [tidyverse](https://tidyverse.tidyverse.org/)).
---
```{r, fig.height=4, fig.width=10}
dat_nowt %>%
ggplot(aes(x = prop, y = FPSTATUS, fill = FPSTATUS)) +
geom_bar(stat = "identity") +
facet_wrap(~COUNTRY) + theme_minimal() +
theme(axis.title = element_blank(), legend.position = "none") +
scale_x_continuous(labels = scales::label_percent())
```
---
For *weighted* population estimates, use [as_survey_design](http://gdfe.co/srvyr/reference/as_survey_design.html) and [survey_mean](http://gdfe.co/srvyr/reference/survey_mean.html) from the [srvyr](http://gdfe.co/srvyr/index.html) package.
Use `prop = TRUE` to adjust standard errors near 0% or 100% for proportions.
```{r}
dat_wtd <- dat %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
group_by(COUNTRY, FPSTATUS) %>%
summarise(survey_mean(prop = TRUE, prop_method = "logit", vartype = "ci"))
dat_wtd
```
---
```{r, fig.height=4, fig.width=10}
dat_wtd %>%
ggplot(aes(x = coef, y = FPSTATUS, fill = FPSTATUS)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = `_low`, xmax = `_upp`), width = 0.2, alpha = 0.5) +
facet_wrap(~COUNTRY) + theme_minimal() +
theme(axis.title = element_blank(), legend.position = "none") +
scale_x_continuous(labels = scales::label_percent())
```
# 4 - Analysis
---
The same [srvyr](http://gdfe.co/srvyr/index.html) toolkit can be used to model our dependent variables with [survey::svyglm](http://r-survey.r-forge.r-project.org/survey/).
Consider women who were *not* using a method at Phase 1:
```{r}
adopt_glm <- dat %>%
filter(CP_1 == 0) %>%
mutate(adopt = FPSTATUS == "Adopted") %>%
group_by(COUNTRY) %>%
summarise(
adopt = cur_data() %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
svyglm(
adopt ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2,
family = "quasibinomial", design = .
) %>%
broom::tidy(exp = TRUE) %>%
mutate(sig = gtools::stars.pval(p.value)) %>%
list()
)
adopt_glm
```
---
For Phase 1 non-users in Burkina Faso, **very high** levels of concern about becoming infected with COVID-19 are significantly associated with higher chances of adopting a contraceptive method (relative to women who had no such concern).
Lesser levels of concern are not statistically significant, nor is household income loss from COVID-19.
```{r}
adopt_glm %>%
filter(COUNTRY == "Burkina Faso") %>%
unnest(adopt)
```
---
In Kenya, neither of these measures are significantly predictive of adoption among non-users.
```{r}
adopt_glm %>%
filter(COUNTRY == "Kenya") %>%
unnest(adopt)
```
---
What about method **dicontinuation** for women who *were* using a method at Phase 1?
```{r}
stop_glm <- dat %>%
filter(CP_1 == 1) %>%
mutate(stop = FPSTATUS == "Discontinued") %>%
group_by(COUNTRY) %>%
summarise(
stop = cur_data() %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
svyglm(
stop ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2,
family = "quasibinomial", design = .
) %>%
broom::tidy(exp = TRUE) %>%
mutate(sig = gtools::stars.pval(p.value)) %>%
list()
)
stop_glm
```
---
This time, neither of the COVID-19 measures are significantly associated with **discontinuation** for Phase 1 contraceptive users in Burkina Faso.
```{r}
stop_glm %>%
filter(COUNTRY == "Burkina Faso") %>%
unnest(stop)
```
---
However, higher levels concern with becoming infected with COVID-19 *are* significantly associated with higher odds of discontinuation for Phase 1 contraceptive users in Kenya.
```{r}
stop_glm %>%
filter(COUNTRY == "Kenya") %>%
unnest(stop)
```
---
For more R tips for IPUMS data, check out:
* The [IPUMS PMA blog](https://tech.popdata.org/pma-data-hub/)
* The [ipumsr](https://tech.popdata.org/ipumsr/) documentation website
* The [ipums tutorials](https://www.ipums.org/support/tutorials) page
Thank you!