--- title: "R Users Breakout Session" author: Matt Gunther - IPUMS PMA Senior Data Analyst date: 2022-04-06 output: ioslides_presentation: widescreen: true smaller: true --- ```{r, echo = FALSE} knitr::opts_chunk$set( R.options = list(width = 100) ) ``` ## Setup R users: remember to select a **.dat (fixed-width text)** data format You'll receive a compressed **dat.gz** file - no need to decompress! Save both of those files in the "data" folder of your working directory.
```{r, echo = FALSE} knitr::include_graphics("images/download.png", dpi = 150) ```
--- You'll need the [ipumsr](https://tech.popdata.org/ipumsr/index.html) package to load them. If not installed, you can download from CRAN. ```{r, eval = FALSE} install.packages("ipumsr") ``` Each session, load the `ipumsr` library before you import data. ```{r, results='hide', message=FALSE, warning=FALSE} library(ipumsr) # Load data into R with `ipumsr` dat <- read_ipums_micro( ddi = "data/pma_00093.xml", data = "data/pma_00093.dat.gz" ) ```
```{r, echo = FALSE} knitr::include_graphics("images/logo.png", dpi = 150) ```
--- Other useful packages for IPUMS data: ```{r, results='hide', message=FALSE, warning=FALSE} # General toolkit library(tidyverse) # For label manipulation: library(labelled) # For survey analysis: library(survey) library(srvyr) ```
```{r, echo = FALSE} knitr::include_graphics("images/logo-ribbon.png", dpi = 150) ```
# 1 - Analytic Sample --- PMA uses an **open panel design** - women may enter the panel after Phase 1, and they may be lost to follow-up after any phase. See [RESULTFQ](https://pma.ipums.org/pma-action/variables/RESULTFQ) Women who enter the panel at Phase 2 are `NA` for all variables at Phase 1. ```{r} dat %>% count(RESULTFQ_1) ``` --- Women whose households were not found again after Phase 1 are `NA` for all variables at Phase 2. ```{r} dat %>% count(RESULTFQ_2) ``` --- We will only include women who were available and completed the Female Questionnaire for *both* Phase 1 and Phase 2. ```{r} dat <- dat %>% filter(RESULTFQ_1 == 1 & RESULTFQ_2 == 1) dat %>% count(RESULTFQ_1, RESULTFQ_2) ``` --- Additionally, PMA samples are only valid for the *de facto* population: women who slept in the household the night before the Household interview. See [RESIDENT](https://pma.ipums.org/pma-action/variables/RESIDENT) ```{r} dat %>% count(RESIDENT_1) ``` We'll also drop cases where the woman was not part of the *de facto* population in either Phase 1 or Phase 2. ```{r} dat <- dat %>% filter(RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22)) ``` --- How many cases remain? ```{r} dat %>% count(COUNTRY) ``` # 2 - Recoding Independent variables --- PMA surveys contain many **categorical** variables. These are usually represented as **factors** in R. In an IPUMS data extract, you won't see factors! Instead, we generate **labelled** numeric variables (note the label in brackets). ```{r} dat %>% ipums_var_label(CVINCOMELOSS_2) dat %>% count(CVINCOMELOSS_2) ``` --- The [ipumsr](https://tech.popdata.org/ipumsr/index.html) package contains tools for working with labelled IPUMS data. Usually, we handle codes like `99 [NIU (not in universe)]` before transforming other missing data to `NA`. ```{r} dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2) ``` --- **Tip:** Information the code `NIU (not in universe)` can always be found on a variable's [universe tab](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS#universe_section).
```{r, echo = FALSE} knitr::include_graphics("images/universe.png", dpi = 300) ```
--- For [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS), `99 [NIU (not in universe)]` may indicate that the household experienced *no income loss in the last year*, or it may indicate that [HHINCOMELOSSAMT_2](https://pma.ipums.org/pma-action/variables/HHINCOMELOSSAMT) is `98 [No response or missing]`. We should treat the `NIU` women from households without *any* income loss as "No" in [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS). ```{r} dat <- dat %>% mutate( CVINCOMELOSS_2 = CVINCOMELOSS_2 %>% labelled::recode_if(HHINCOMELOSSAMT_2 == 1, 0) ) dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2) ``` --- Next, we'll use `NA` to represent the remaining values above `90`: * `97 [Don't know] ` and * remaining cases marked `99 [NIU (not in universe)]` ```{r} dat <- dat %>% mutate( CVINCOMELOSS_2 = CVINCOMELOSS_2 %>% lbl_na_if(~.val > 90) ) dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2) ``` --- Once you're done with labels, we recommend transforming key variables into **factors** with [forcats::as_factor](https://forcats.tidyverse.org/reference/as_factor.html). The [forcats](https://forcats.tidyverse.org) package is included when you load `library(tidyverse)`. ```{r} dat <- dat %>% mutate(CVINCOMELOSS_2 = as_factor(CVINCOMELOSS_2)) dat %>% count(CVINCOMELOSS_2) ``` This will make categorical variables easier to use in data visualization and as "dummy" variables in regression analysis. --- Likert-style questions can be treated as factors, too. ```{r} dat %>% ipums_var_label(COVIDCONCERN_2) dat %>% count(COVIDCONCERN_2) ``` --- This time we'll treat codes `5` and above as `NA`. ```{r} dat <- dat %>% mutate( COVIDCONCERN_2 = COVIDCONCERN_2 %>% lbl_na_if(~.val >= 5) %>% as_factor() ) dat %>% count(COVIDCONCERN_2) ``` --- You can apply the same transformation to several variables with help from [dplyr::across](https://dplyr.tidyverse.org/reference/across.html). [dplyr]() is another package included when you load `library(tidyverse)`. ```{r} dat <- dat %>% mutate( across( c(COUNTRY, URBAN, WEALTHT_2, EDUCATTGEN_2), ~.x %>% lbl_na_if(~.val >= 90) %>% as_factor() ) ) ``` --- Often, it's important to set a **reference group** against which all dummy variables will be compared. You can manually specify a **refernece group** when you set factor "levels" with a function like [forcats::fct_relevel](https://forcats.tidyverse.org/reference/fct_relevel.html). ```{r} dat <- dat %>% mutate( AGE_2 = case_when( AGE_2 < 25 ~ "15-24", AGE_2 < 35 ~ "25-34", AGE_2 < 50 ~ "35-49" ), AGE_2 = AGE_2 %>% fct_relevel("15-24", "25-34", "35-49") ) ``` # 3 - Dependent variables --- We'll use our recoded variables to model the likelihood of contraceptive method **adoption** and **discontinuation** between phases. See [CP](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS) ```{r} dat <- dat %>% filter(CP_1 < 90 & CP_2 < 90) dat %>% count(CP_1, CP_2) ``` --- A woman has **adopted** a method if she was *not* using one at Phase 1, but then reported using one at Phase 2. She has **discontinued** a method if she *did* use one at Phase 1, but no longer uses one at Phase 2. ```{r} dat <- dat %>% mutate( FPSTATUS = case_when( CP_1 == 1 & CP_2 == 1 ~ "User", CP_1 == 0 & CP_2 == 0 ~ "Non-user", CP_1 == 1 & CP_2 == 0 ~ "Discontinued", CP_1 == 0 & CP_2 == 1 ~ "Adopted" ), FPSTATUS = fct_infreq(FPSTATUS) ) ``` --- Un-weighted sample proportions for `FPSTATUS` can be found with [count](https://dplyr.tidyverse.org/reference/count.html) and [prop.table](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/prop.table): ```{r} dat_nowt <- dat %>% group_by(COUNTRY) %>% count(FPSTATUS) %>% mutate(prop = prop.table(n)) dat_nowt ``` We'll plot this table with [ggplot2](https://ggplot2.tidyverse.org/index.html) (also included with the [tidyverse](https://tidyverse.tidyverse.org/)). --- ```{r, fig.height=4, fig.width=10} dat_nowt %>% ggplot(aes(x = prop, y = FPSTATUS, fill = FPSTATUS)) + geom_bar(stat = "identity") + facet_wrap(~COUNTRY) + theme_minimal() + theme(axis.title = element_blank(), legend.position = "none") + scale_x_continuous(labels = scales::label_percent()) ``` --- For *weighted* population estimates, use [as_survey_design](http://gdfe.co/srvyr/reference/as_survey_design.html) and [survey_mean](http://gdfe.co/srvyr/reference/survey_mean.html) from the [srvyr](http://gdfe.co/srvyr/index.html) package. Use `prop = TRUE` to adjust standard errors near 0% or 100% for proportions. ```{r} dat_wtd <- dat %>% as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>% group_by(COUNTRY, FPSTATUS) %>% summarise(survey_mean(prop = TRUE, prop_method = "logit", vartype = "ci")) dat_wtd ``` --- ```{r, fig.height=4, fig.width=10} dat_wtd %>% ggplot(aes(x = coef, y = FPSTATUS, fill = FPSTATUS)) + geom_bar(stat = "identity") + geom_errorbar(aes(xmin = `_low`, xmax = `_upp`), width = 0.2, alpha = 0.5) + facet_wrap(~COUNTRY) + theme_minimal() + theme(axis.title = element_blank(), legend.position = "none") + scale_x_continuous(labels = scales::label_percent()) ``` # 4 - Analysis --- The same [srvyr](http://gdfe.co/srvyr/index.html) toolkit can be used to model our dependent variables with [survey::svyglm](http://r-survey.r-forge.r-project.org/survey/). Consider women who were *not* using a method at Phase 1: ```{r} adopt_glm <- dat %>% filter(CP_1 == 0) %>% mutate(adopt = FPSTATUS == "Adopted") %>% group_by(COUNTRY) %>% summarise( adopt = cur_data() %>% as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>% svyglm( adopt ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2, family = "quasibinomial", design = . ) %>% broom::tidy(exp = TRUE) %>% mutate(sig = gtools::stars.pval(p.value)) %>% list() ) adopt_glm ``` --- For Phase 1 non-users in Burkina Faso, **very high** levels of concern about becoming infected with COVID-19 are significantly associated with higher chances of adopting a contraceptive method (relative to women who had no such concern). Lesser levels of concern are not statistically significant, nor is household income loss from COVID-19. ```{r} adopt_glm %>% filter(COUNTRY == "Burkina Faso") %>% unnest(adopt) ``` --- In Kenya, neither of these measures are significantly predictive of adoption among non-users. ```{r} adopt_glm %>% filter(COUNTRY == "Kenya") %>% unnest(adopt) ``` --- What about method **dicontinuation** for women who *were* using a method at Phase 1? ```{r} stop_glm <- dat %>% filter(CP_1 == 1) %>% mutate(stop = FPSTATUS == "Discontinued") %>% group_by(COUNTRY) %>% summarise( stop = cur_data() %>% as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>% svyglm( stop ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2, family = "quasibinomial", design = . ) %>% broom::tidy(exp = TRUE) %>% mutate(sig = gtools::stars.pval(p.value)) %>% list() ) stop_glm ``` --- This time, neither of the COVID-19 measures are significantly associated with **discontinuation** for Phase 1 contraceptive users in Burkina Faso. ```{r} stop_glm %>% filter(COUNTRY == "Burkina Faso") %>% unnest(stop) ``` --- However, higher levels concern with becoming infected with COVID-19 *are* significantly associated with higher odds of discontinuation for Phase 1 contraceptive users in Kenya. ```{r} stop_glm %>% filter(COUNTRY == "Kenya") %>% unnest(stop) ``` --- For more R tips for IPUMS data, check out: * The [IPUMS PMA blog](https://tech.popdata.org/pma-data-hub/) * The [ipumsr](https://tech.popdata.org/ipumsr/) documentation website * The [ipums tutorials](https://www.ipums.org/support/tutorials) page Thank you!