Analysing survey data with complex sampling in R

This sections covers: regression models and other advanced applications using European Social Survey data.

It also provides space to practice reading in a new dataset and survey design object.

Welcome to the afternoon session!

Please download the ESS data before preceding

Go back to the “Set up” page for further instructions.

Data import

We will be importing the second dataset, the European Social Survey.

The data was downloaded as a .csv file, so we use the read_csv function from base R.

Re-specify the “root” function if this value is no longer in your working directory. Remember to adjust this code to match the pathname to the folder where your workshop data is saved.

Code

root <- "C://Users//s1769862//OneDrive - University of Edinburgh//SLTF-workshop-August2024//" # change with where the folder lies in your filepath

If you downloaded the .csv file as a single folder from ESS without altering any file names or structure the data can be imported with this relative path and the root directory specified above.

Code

ess9 <- read.csv(file.path(root, "ESS9e03_2", "ESS9e03_2.csv"))

Notice that variables from the .csv file imported as more typical classes, and do not have haven labelling like the .dta import.

Code

class(ess9$cntry) # character for letter input

[1] "character"

Code

class(ess9$nwspol) # integer for numeric input

[1] "integer"

Code

class(ess9$ppltrst) # integer for numeric input

[1] "integer"

Data cleaning

The ESS data is very large and complex, so we will only clean a few variables to use in this example. The variety of the data also means we cannot apply a single logic for recoding all missing values, as was done with the Children’s Worlds Survey.

Caution

Remember to complete all cleaning and wrangling before setting the survey design object. However, do not restrict cases!

We will clean a few variables of interest.

ppltrst (“Most people can be trusted or you can’t be too careful”)
trstprl (“Trust in country’s parliament”)
nwspol (Minutes spent consuming news)
vote (“Voted in last election”)
gndr (“Respondent gender”)
agea (“Respondent age)

Code

# Missing transformation for ordinal variables
ess9$trstprl[ess9$trstprl>10] <- NA

# cutoffs for continuous variables
ess9$nwspol[ess9$nwspol>7000] <- NA
ess9$agea[ess9$agea>6000] <- NA

Code

ess9 <- ess9 %>%
  dplyr::mutate(vote = as.factor(
                  case_when(vote == 1 ~ "Yes",
                            vote == 2 ~ "No",
                            .default = NA)),
                gender = as.factor(
                  case_when(gndr == 1 ~ "Male",
                            gndr == 2 ~ "Female",
                            .default = NA)),
                age = as.factor(
                  case_when(agea >17 & agea <26 ~ "18 to 25",
                            agea >25 & agea <36 ~ "25 to 35",
                            agea >35 & agea <50 ~ "36 to 49",
                            agea >49 & agea <65 ~ "50 to 64",
                            agea >64 & agea <115 ~ "65+",
                            agea <18 | agea >115 ~ NA)))

Sampling design object

The ESS makes our job easy and provides detailed guidance on setting the survey design object.

According to the guide our survey design variables include:

psu

The primary sampling unit, the smallest grouping where individual respondents are sampled from.

stratum

The higher level stratification of the sampling design.

anweight

The analysis weight, which is suitable for all analyses.

pspweight

The post-stratification weight, adjusting for non-response errors.

design

The design weight, adjusting for the probability of selection into the sample.

ess9

The name of our data from in R working memory.

The documentation provides further detail about the weights because there are multiple that are available.

Analysis weight

“Anweight corrects for differential selection probabilities within each country as specified by sample design, for nonresponse, for noncoverage, and for sampling error related to the four post-stratification variables, and takes into account differences in population size across countries. It is constructed by first deriving the design weight, then applying a post-stratification adjustment, and then a population size adjustment.”

Post-stratification weight

“While the design weights account for differences in inclusion probabilities, sampling errors (related to attempting to measure only a fraction of the population) and possible non-response errors (which may lead to a systematic over- or under-representation of people with certain characteristics) are still present. Post-stratification weights are a more sophisticated weighting strategy that uses auxiliary information to reduce the sampling error and potential non-response bias. They have been constructed using information on age group, gender, education, and region. The post-stratification weights are obtained by adjusting the design weights in such a way that they will replicate the distribution of the cross-classification of age group, gender, and education in the population and the marginal distribution for region in the population.”

Design weight

“Several countries use complex sampling designs where some groups or regions of the population have higher probabilities of selection. The main purpose of the design weights is to correct for the fact that in some countries respondents have different probabilities to be part of the sample due to the sampling design used. Applying the weights allows for the construction of design unbiased estimators. The design weights are computed as the inverse of the inclusion probabilities, i.e. the probability of each person to be included into the sample. The inverse inclusion probabilities are then scaled such that their sum equals the net sample size and the mean equals one.”

The analysis weight anweight is a combination of the post-stratification weight pspweight and the population weight pweight. ESS weighting guidelines recommend using anweight for general analysis, so we will employ the recommended specification in the example today.

Create the survey design object.

Code

design2 <- survey::svydesign(ids = ~psu, strata = ~stratum, weights = ~anweight, data = ess9)

Regression

Now we have the tools to dive into regression!

The survey package builds on the “General Linear Model” glm model framework, which underpins many popular regression functions in R. If you know the syntax for the glm you can apply pretty much any decisions to the svyglm function.

The glm function has several components:

Specify the relationship between our independent and dependent variables as a formula.

For example, outcome \(y\) with predictive variables \(x_1\), \(x_2\), or (\(y = x_{1} + x_{2} + x_{3}\))

would be expressed in the function as y ~ x1 + x2 + x3.

The family of distributions to apply to your formula.

For example, gaussian, quasi-binomial, binomial, or poisson.

The linking function for the algorithm that performs the regression calculation.

For example, identity, logit, or inverse. It does not need to be specified if defaults for the family are correct.

The name of your data frame in R working memory.

From the user’s perspective, the primary difference between calling the svyglm() and regular glm() function is specifying the survey design object design= in place of the data frame data=. In our case, this means specifying design2 instead of ess9.

Linear regression

Let’s try a simple ordinary least squares (OLS) linear regression.

Is the amount of time one spends consuming news associated with how old they are?

If news consumption is \(y\) and age is \(x\) then our formula is simply nwspol ~ gender .

Code

options(survey.lonely.psu="adjust") # adjust for lonely psu

summary(survey::svyglm(nwspol ~ age, family=gaussian, design=design2))


Call:
svyglm(formula = nwspol ~ age, design = design2, family = gaussian)

Survey design:
survey::svydesign(ids = ~psu, strata = ~stratum, weights = ~anweight, 
    data = ess9)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   79.831      4.670  17.093  < 2e-16 ***
age25 to 35    5.072      5.561   0.912  0.36175    
age36 to 49    3.880      5.066   0.766  0.44367    
age50 to 64   17.284      5.511   3.136  0.00171 ** 
age65+        39.109      4.874   8.023 1.08e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 27434.39)

Number of Fisher Scoring iterations: 2

It appears that age has a significant association with new consumption. The time spent consuming news media is elevated for the highest age groups compared to the youngest.

We can also include tbl_regression() to make our model output more readable.

Code

survey::svyglm(nwspol ~ age, family=gaussian, design=design2) %>%
  gtsummary::tbl_regression() %>%
  modify_caption("Predicting time spent watching news") %>%
  bold_p(t=0.05)

Predicting time spent watching news
Characteristic	Beta	95% CI¹	p-value
age
18 to 25	—	—
25 to 35	5.1	-5.8, 16	0.4
36 to 49	3.9	-6.0, 14	0.4
50 to 64	17	6.5, 28	0.002
65+	39	30, 49	<0.001
¹ CI = Confidence Interval

Adjusting for lonely primary sampling unit

If we run this code without the adjust for longely PSU we receive the following error message: Error in onestrat(attr<-(x[index, , drop = FALSE], "recentering", recentering), : Stratum (1680) has only one PSU at stage 1

This is quite a common error message. It means that there are not sufficient primary sampling units within a stratum, so standard calculations are failing.

A commonly suggested solution is to allow the survey package to adjust for “lonely” PSUs. You can research all adjustment options and decide which is best for your use case.

The default, which causes an error message and refuses to run the calculation.

Removes the single PSU so it has no contribution to variance calculations.

Also specifies that the single PS will have to contribution to variance

Centres the single PSU at the sample grand mean rather than the stratum grand mean.

We will set options(survey.lonely.psu="adjust") which is suggested as good choice when the problem is believe to be caused by random missingness, rather than an issue with the sampling design.

Binary logistic regression

The survey package can handle multiple types of linear regressions, including binary logistic regression.

Let us investigate the probability of individuals having voted in their last national election, given their trust in political institutions.

Note

Including exponentiate = T provides the coefficient estimates in odds ratios.

Code

options(survey.lonely.psu="adjust") # adjust for lonely psu

survey::svyglm(vote ~ trstprl, family= quasibinomial, design=design2) %>% gtsummary::tbl_regression(exponentiate=T)

Characteristic	OR¹	95% CI¹	p-value
trstprl	1.15	1.13, 1.17	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Remember that 0 refers to a state of maximum distrust and 10 signifies the highest state of trust. Therefore, these model results indicate that there is a positive relationship between increased trust in parliament and whether a respondent voted in their last national election.

Just as with descriptive tables, we can adjust the regression summary table to include more details.

Code

survey::svyglm(vote ~ trstprl, family= quasibinomial, design=design2) %>%
  gtsummary::tbl_regression(exponentiate = T,
                            label = c(trstprl~"Trust in parliament")) %>%
  bold_p(t=0.05) %>%
  modify_caption("Predicting the odds of having voted in the last national election") %>%
  as_gt() %>%
  gt::tab_source_note(gt::md("Data source: European Social Survey, 2018"))

Predicting the odds of having voted in the last national election
Characteristic	OR¹	95% CI¹	p-value
Trust in parliament	1.15	1.13, 1.17	<0.001
Data source: European Social Survey, 2018
¹ OR = Odds Ratio, CI = Confidence Interval

We can alter the formula for our model we to add control variables to the binary logit.

Code

survey::svyglm(vote ~ trstprl + gender + age, family=quasibinomial, design = design2) %>%
  gtsummary::tbl_regression(exponentiate = T,
                            label = c(trstprl~"Trust in parliament")) %>%
  bold_p(t=0.05) %>%
  modify_caption("Predicting the odds of having voted in the last national election with control variables") %>%
  as_gt() %>%
  gt::tab_source_note(gt::md("Data source: European Social Survey, 2018"))

Predicting the odds of having voted in the last national election with control variables
Characteristic	OR¹	95% CI¹	p-value
Trust in parliament	1.17	1.15, 1.19	<0.001
gender
Female	—	—
Male	1.04	0.96, 1.12	0.4
age
18 to 25	—	—
25 to 35	1.43	1.24, 1.66	<0.001
36 to 49	2.06	1.79, 2.37	<0.001
50 to 64	3.01	2.61, 3.47	<0.001
65+	3.41	2.95, 3.94	<0.001
Data source: European Social Survey, 2018
¹ OR = Odds Ratio, CI = Confidence Interval

Our result remains significant, even when controlling for age and gender. We can also see that as age increase so does the odds of having voted in the last election.

Investigating the influence of the survey adjustments

What would happen if we did not adjust for survey weighting?

We can run the same model and below with the base glm function and compare the results.

Code

glm(vote ~ trstprl + gender + age, family=quasibinomial, dat= ess9) %>%
  gtsummary::tbl_regression(exponentiate = T,
                            label = c(trstprl~"Trust in parliament")) %>%
  bold_p(t=0.05) %>%
  modify_caption("Predicting the odds of having voted in the last national election with control variables") %>%
  as_gt() %>%
  gt::tab_source_note(gt::md("Sampling adjustment *not* applied. Data source: European Social Survey, 2018"))

Predicting the odds of having voted in the last national election with control variables
Characteristic	OR¹	95% CI¹	p-value
Trust in parliament	1.17	1.16, 1.19	<0.001
gender
Female	—	—
Male	1.06	1.01, 1.12	0.010
age
18 to 25	—	—
25 to 35	1.71	1.55, 1.88	<0.001
36 to 49	2.38	2.18, 2.60	<0.001
50 to 64	3.32	3.03, 3.62	<0.001
65+	3.91	3.58, 4.28	<0.001
Sampling adjustment not applied. Data source: European Social Survey, 2018
¹ OR = Odds Ratio, CI = Confidence Interval

Strikingly, the effect of gender on voting behaviour becomes significant when the sampling design is not accounted for. This indicates that we would erroneously report a relationship that is not robust if tools for complex samples were not utilised.

Task 8

Construct your own regression analysis and display the results in a gtsummary table!

Tip

Remember to convert any missing values to NA and ensure that your columns are the correct class for your analysis. If you make any changes to the original data frame in your data cleaning, remember to specify a new survey design object with the updated data frame.

Code

## Write your own code!

## hint: copy and past code from previous chunks

Interaction effects

We can also specify interaction effects by modifying the regression formula. For example, I may wonder if trust in politics has a different influence on past voting behaviour by gender.

Code

survey::svyglm(vote ~ trstprl*age + gender, family=quasibinomial, design = design2) %>%
  gtsummary::tbl_regression(exponentiate = T,
                            label = c(trstprl~"Trust in parliament")) %>%
  bold_p(t=0.05) %>%
  modify_caption("Predicting the odds of having voted in the last national election with interaction effects") %>%
  as_gt() %>%
  gt::tab_source_note(gt::md("Data source: European Social Survey, 2018"))

Predicting the odds of having voted in the last national election with interaction effects
Characteristic	OR¹	95% CI¹	p-value
Trust in parliament	1.17	1.12, 1.23	<0.001
age
18 to 25	—	—
25 to 35	1.58	1.18, 2.13	0.002
36 to 49	2.00	1.50, 2.65	<0.001
50 to 64	2.86	2.13, 3.86	<0.001
65+	3.39	2.53, 4.56	<0.001
gender
Female	—	—
Male	1.03	0.96, 1.12	0.4
Trust in parliament * age
Trust in parliament * 25 to 35	0.98	0.92, 1.03	0.4
Trust in parliament * 36 to 49	1.01	0.95, 1.07	0.8
Trust in parliament * 50 to 64	1.01	0.96, 1.07	0.7
Trust in parliament * 65+	1.00	0.94, 1.06	>0.9
Data source: European Social Survey, 2018
¹ OR = Odds Ratio, CI = Confidence Interval

Looks like my theory was not supported by the data. Although trust in politicians and age both have main effects on voting, there is not evidence of a significant interaction effect.

Save data

To transport our analysis to the final session we can save it as an RData file to the workshop folder.

Code

saveRDS(ess9, file = file.path(root, "Data", "ess9.RData"))

Next: Go to Part 2.2: Extensions