Analysing survey data with complex sampling in R - 1.2: Survey adjusted descriptive statistics

This section covers: creating a survey design object, adjusted descriptive statistics and tables!

Load data

Remember that we saved the wrangled dataset to the working directory in the previous section as an RData file in our workshop folder.

Code

root <- "C://Users//s1769862//OneDrive - University of Edinburgh//SLTF-workshop-August2024//" # change with where the folder lies in your filepath
cws <- readRDS(file= file.path(root, "Data", "cws.RData"))

Sampling design

How do we handling the sampling design of this survey? First, we read the Data Documentation to learn more about the methodology and any population correcting weighting that we can apply.

Take a minute to search for this information on your own. What clues can you draw out?

Looking at the documentation, some key elements appear to be….

1.2 Sampling strategy and outcomes (pg. 4)

“The England sample was designed to achieve a nationally representative sample of children in school….. The primary sampling unit was schools. Separate samples were drawn for Years 4 and 6 (primary school education) and Year 8 (secondary school education). Both samples followed the same methodology. First, a complete list of schools in England was stratified into five groups by the proportion of children receiving free school meals (a very rough indicator of economic prosperity).”

“These groups were approximate quintiles (based on numbers of pupils in each stratum). The approximation was because of a lack of precision in the data available on free school meal entitlement. Within each stratum schools were selected randomly with probabilities proportional to size (number of pupils), with the aim of achieving a target of at least eight schools per stratum. Within each selected school, one class group (not grouped on pupil ability) was randomly selected.”

Reading this paragraph, we can see that this dataset has…

England divided into five stratum
Schools are the primary sampling unit within each stratum
Individual weights to readjust for sample representativeness

From this information, we can create what is called a “survey design object”. This is where the survey design information is encoded into our R working environment and can be called into other statistical functions.

Note

We can also look at the Survey package documentation for more details and help using ?? survey or help(svydesign) into the console

The “ids” indicates the primary sampling unit (PSU). This can be thought of as the smallest unit which individuals are sampled from.

Formula for cluster sampling probabilities.

How the population was originally divided.

Finite population correction.

The individual probability of selection.

The name of the data frame in R working memory.

In our dataset the column names for these are:

ids: SchoolRef
strata: Strata
weights: Weight
data: cws ¹.

We do not have a finite population correction or cluster sampling probability information, so we leave these fields blank.

Tip

We also include “check.strata = TRUE” as a sense check to ensure that our primary sampling units (the schools) are all nested within our strata. This should return nothing if all is well.

Code

design1 <- survey::svydesign(ids = ~SchoolRef, strata =~Strata, weights = ~Weight, data =cws, check.strata=T)

We now have a survey design object which stores both our original dataset and the encoded information about its sampling and weighting.

Descriptive statistics

Now, we can start to use the survey design object to display statistics that are adjusted for the sampling design and weighting.

Code

survey::svymean(~LifeGoingWell, design1, na.rm=TRUE)

                mean     SE
LifeGoingWell 8.3951 0.1027

Let’s compare these to the unweighted version, using “base” r.

How does accounting for the sampling design and population weighting change our estimates?

Code

# estimate un-weighted mean and sd 
base::mean(cws$LifeGoingWell, na.rm=T) # mean

[1] 8.451588

Code

## compare with survey weighted estimates
(survey::svymean(~LifeGoingWell, design1, na.rm=T)) - (mean(cws$LifeGoingWell, na.rm=T)) # can see a modest correction in the estimate

                   mean     SE
LifeGoingWell -0.056477 0.1027

Comparing the two estimates, we can see that the estimate for “life going well” is slightly lower for the population estimates compared to the face-value calculation.

Variance and standard error

We can also see that the standard error around the mean is larger for the survey weighted result. Therefore, our error estimates would be underestimated if we assumed a simple random sample and ignored the sampling design.

Code

sd((cws$LifeGoingWell), na.rm=TRUE)/sqrt(length(!is.na(cws$LifeGoingWell))) # standard error

[1] 0.05800934

Code

survey::svyvar(~LifeGoingWell, design1, na = TRUE) # survey weighted estimate

              variance     SE
LifeGoingWell   4.6378 0.4703

Deciles

Unfortunately, the decile results do not display well in the .quarto website output. However, I recommend that you run these lines on your own machine to see the raw output

Code

survey::svyquantile(~LifeGoingWell, design1, na = TRUE, c(.25,.5,.75),ci=TRUE)

Code

survey::svyquantile(~LifeGoingWell, design1, na = TRUE, c(.20,.40,.60,.80),ci=TRUE)

Tables

We can also create adjusted tables with the survey package.

Code

survey::svytable(~LifeGoingWell, design1)

LifeGoingWell
         0          1          2          3          4          5          6 
 11.810352   4.225975  16.652450  26.043925  42.924747  43.591784  60.086619 
         7          8          9         10 
101.180865 166.925511 241.002194 572.818609

We can see how these estimates differ from the unweighted estimates provided in from the base table estimates.

Code

base::table(cws$LifeGoingWell)


  0   1   2   3   4   5   6   7   8   9  10 
 10   6  14  21  41  48  61  95 166 239 590

Tables are helpful for seeing the weighting option in our survey design object at work. However, they are not very pretty! This becomes more of a problem for larger or more complex tables, e.g. cross tabulations.

We also often want to present multiple types of descriptive statistics, e.g. the adjusted variance or other attributes of a distribution, within the same table.

For this, we need a better table package.

Tables with gtsummary

We can improve descriptive statistics summaries by utilising table packages.

The GT framework has a good set of table output options for descriptive statistics, bivariate and regression results.

The the gtsummary package is a wrapper for survey that was inspired by GT but makes some handy decisions and incorporates the survey design into the descriptive statistics reporting.

Code

gtsummary::tbl_svysummary(design1,
                          include = c("FrequencyHelpHousework", "HeardOfUNCRC"),
                          missing_text = "NA")

Characteristic	N = 1,319¹
FrequencyHelpHousework
About every day	614 (48%)
Less than weekly	132 (10%)
Once or twice a week	440 (34%)
Rarely or never	94 (7.4%)
NA	38
HeardOfUNCRC
No	278 (22%)
Not sure	600 (47%)
Yes	388 (31%)
NA	53
¹ n (%)

We can also make tables for continuous variables

Code

gtsummary::tbl_svysummary(design1,
                          include = c("FeelPositiveFuture", "FeelingHappy", "PeopleFriendly", "LifeJustRight"),
                          statistic = list(all_continuous() ~ "{mean} ({sd})"),
                          missing = "no")

Characteristic	N = 1,319¹
I feel positive about my future	8.20 (2.32)
Last two weeks: How often feeling happy	8.34 (2.40)
People are generally pretty friendly towards me	8.23 (2.37)
My life is just right	8.14 (2.47)
¹ Mean (SD)

Summary table example 2

Here is another example comparing the weighted and unweighted summary tables for changes in household spending from the 12-year-olds’ perspectives.

First, a table using base R.

Code

base::table(cws$FamilyMoneyChange, exclude=F)


   1    2    3 <NA> 
 425  316  134  444

Code

gtsummary::tbl_summary(cws, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1)))

Characteristic	N = 1,319¹
Q15 Compared to a year ago, how much money does your family have now?
1	425 (48.6%)
2	316 (36.1%)
3	134 (15.3%)
Unknown	444
¹ n (%)

If we wish to have the categorical responses listed, rather than factor numeric values, we need to change the dataset itself.

Code

cws <- cws %>%
  mutate(FamilyMoneyChange = as.factor(case_when(
    FamilyMoneyChange == 1 ~ "We have more money than a year ago",
    FamilyMoneyChange == 2 ~ "We have about the same as a year ago",
    FamilyMoneyChange == 3 ~ "We have less money than a year ago")))


gtsummary::tbl_summary(cws, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Unweighted frequences for 12-year-olds perceptions' of family finances") %>% modify_footnote(update = everything() ~ "Data Source: 2013/14 Childrens World Survey, England")

Unweighted frequences for 12-year-olds perceptions’ of family finances
Characteristic¹	N = 1,319¹
FamilyMoneyChange
We have about the same as a year ago	316 (36.1%)
We have less money than a year ago	134 (15.3%)
We have more money than a year ago	425 (48.6%)
Unknown	444
¹ Data Source: 2013/14 Childrens World Survey, England

If we have changed anything in the original dataset, such as recoding a variable (even to just make it more clearly labelled!) this will not automatically updated in the survey design object. If we wish to update this, we need to recreate it with the updated dataset.

Tip

It is often best practice to complete all data manipulation (e.g. recodes) before setting the survey design object.

Code

design1 <- survey::svydesign(ids = ~SchoolRef, strata =~Strata, weights = ~Weight, data =cws, check.strata=T)

Now, if we run the same code as the previous chunk, it will updated to have the character names for our categorical variable encoded in the table.

Code

gtsummary::tbl_svysummary(design1, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Weighted frequences for 12-year-olds perceptions' of family finances") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")

Weighted frequences for 12-year-olds perceptions’ of family finances
Characteristic	N = 1,319¹
FamilyMoneyChange
We have about the same as a year ago	339 (38.0%)
We have less money than a year ago	143 (16.1%)
We have more money than a year ago	410 (46.0%)
Unknown	427
¹ Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied

NA column

Notice that the “number” of missing values has changed between the unweighted, base R table and the gtsummary table. This is because the distribution of item level non response are associated with different weightings. Larger NA estimates in the weighted tables indicate that observations which did not offer a valid response to the question have higher weighting, on average, resulting in enlarged influence of this category in the nationally representative statistics

Task 3

Produce a weighted frequency table of a variable of interest.

Now, try it yourself! Create survey weighted summary statistics for a different variables in the dataset.

Hint

Look up the data documentation with help(tbl_svysymmary)

What settings can you play around with?
How can you use this tool to explore your dataset?

Code

## enter your own code here



## hint: copy segments from previous code chunks

A possible solution can be viewed below:

Code

gtsummary::tbl_svysummary(design1, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Weighted frequences for 12-year-olds perceptions' of family finances") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")

Weighted frequences for 12-year-olds perceptions’ of family finances
Characteristic	N = 1,319¹
FamilyMoneyChange
We have about the same as a year ago	339 (38.0%)
We have less money than a year ago	143 (16.1%)
We have more money than a year ago	410 (46.0%)
Unknown	427
¹ Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied

Cross tabulations

We can also calculate cross tabulations which take into account our survey design corrections and weighting by utilising the “by” option in tbl_svysummary

Code

gtsummary::tbl_svysummary(design1,
                          include = c("FeelPositiveFuture", "FeelingHappy", "PeopleFriendly", "LifeJustRight", "FamilyMoneyChange"),
                          by = FamilyMoneyChange,
                          statistic = list(all_continuous() ~ "{mean} ({sd})"))

Characteristic	We have about the same as a year ago, N = 339¹	We have less money than a year ago, N = 143¹	We have more money than a year ago, N = 410¹
I feel positive about my future	8.25 (2.11)	7.30 (2.68)	8.67 (2.12)
Unknown	14	3	9
Last two weeks: How often feeling happy	8.38 (2.12)	7.21 (2.85)	8.79 (2.11)
Unknown	4	4	16
People are generally pretty friendly towards me	8.32 (2.13)	7.30 (2.88)	8.69 (2.03)
Unknown	9	0	3
My life is just right	8.34 (2.04)	7.05 (2.84)	8.47 (2.37)
Unknown	3	2	14
¹ Mean (SD)

If we use two categorical variables instead of one categorical and one “continuous” (an ordinal Likert scale, in this case), the cross table will also compute a \(x^2\) statistic when we do the add_p() option.

Code

gtsummary::tbl_svysummary(design1, include = c("FrequencyWatchTV", "FrequencySportsExercise"), by = FrequencyWatchTV) %>%
  modify_caption("Cross tabulation of time use by 12 year olds in England") %>%
  add_p()%>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied") %>%
  modify_header(label ~ "FrequencyWatchTV")

Cross tabulation of time use by 12 year olds in England
FrequencyWatchTV	About every day, N = 1,056¹	Less than weekly, N = 46¹	Once or twice a week, N = 162¹	Rarely or never, N = 16¹	p-value²
FrequencySportsExercise					<0.001
About every day	574 (55%)	16 (34%)	64 (41%)	5 (29%)
Less than weekly	58 (5.6%)	13 (29%)	18 (12%)	0 (0%)
Once or twice a week	345 (33%)	11 (25%)	67 (42%)	9 (58%)
Rarely or never	60 (5.8%)	6 (13%)	9 (5.6%)	2 (13%)
Unknown	20	0	4	0
¹ Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
² chi-squared test with Rao & Scott’s second-order correction

We can produce a variety of bivariate test results within our gtsummary tables, including t-tests.

A full list can be found here

Code

gtsummary::tbl_svysummary(design1,
                          include = c("Gender", "SatisfiedAppearance"),
                          by = Gender) %>%
  modify_caption("Cross tabulation of satisfaction with personal appearance from 12 year olds in England, by child's gender") %>%
  add_p(test = all_continuous() ~ "svy.t.test") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")

Cross tabulation of satisfaction with personal appearance from 12 year olds in England, by child’s gender
Characteristic	Boy, N = 673¹	Girl, N = 634¹	p-value²
Satisfaction with: The way that you look	9.0 (7.0, 10.0)	7.0 (5.0, 9.0)	<0.001
Unknown	11	13
¹ Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
² t-test adapted to complex survey samples

Task 4

Produce a weighted cross tabulation of two variables of interest and include an appropriate inferential test for the type of variable.

Code

## your code goes here

Summary tips

Read the dataset documentation clearly to identify all elements of the sampling design provided. Ensure there are no missing values for any columns with design information.
If we make any changes to our data set we need to update the survey design object.
The gtsummary package is great, but the syntax is confusing at first! Get used to spending time reading its helpful data documentation.

Sub-setting

Previously we discussed that it is important to not alter the number of cases (in this case, individuals in the dataset) from the full number utilised by the data managers to calculate the survey weights.

What happens when we change our dataset or work with a sub-sample?

We may wish to restrict our data frame due to..

Set of columns	Sub-sample
Different number of columns	Different number of units
Need to specify new dataset	Need to specify the reduction in units
A matter of code change	A matter of mathematical change

We already demonstrated the latter in a previous example. Now, we explore how to conduct sub-sample analyses.

Sub-sample analysis: by gender

We create a new design object which is a subset of our original object by a condition of our choice

Code

design.subset <- subset(design1, Gender == "Boy")

We can compare the number of primary sampling units (in this case schools!) between our original design and the subset

Code

summary(design.subset)

Stratified 1 - level Cluster Sampling design (with replacement)
With (33) clusters.
subset(design1, Gender == "Boy")
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2008  0.6842  1.2032  2.0543  4.2270  5.6452 
Stratum Sizes: 
             1  2   3   4   5
obs        246 98 110 106 110
design.PSU   8  7   7   6   7
actual.PSU   7  6   7   6   7
Data variables:
  [1] "QRef"                          "SchoolRef"                    
  [3] "Strata"                        "Weight"                       
  [5] "Age"                           "Gender"                       
  [7] "BornThisCountry"               "NHomes"                       
  [9] "HomeType"                      "FirstHomeMother"              
 [11] "FirstHomeFather"               "FirstHomeMPartner"            
 [13] "FirstHomeFPartner"             "FirstHomeGrandmother"         
 [15] "FirstHomeGrandfather"          "FirstHomeSiblings"            
 [17] "FirstHomeOtherChildren"        "FirstHomeOtherAdults"         
 [19] "SecondHomeMother"              "SecondHomeFather"             
 [21] "SecondHomeMPartner"            "SecondHomeFPartner"           
 [23] "SecondHomeGrandmother"         "SecondHomeGrandfather"        
 [25] "SecondHomeSiblings"            "SecondHomeOtherChildren"      
 [27] "SecondHomeOtherAdults"         "HomeSafe"                     
 [29] "HomePlaceToStudy"              "ParentsListen"                
 [31] "FamilyGoodTimeTogether"        "ParentsTreatFairly"           
 [33] "ParentsHelpProblems"           "ParentsAskWhereGoing"         
 [35] "ParentsKeepTrackSchool"        "ParentsRemindWashShower"      
 [37] "ParentsShowInterestSchool"     "ParentsSupportUpset"          
 [39] "ParentsKnowWhereAfterSchool"   "ParentsEnsureSeeDoctor"       
 [41] "SatisfiedHouse"                "SatisfiedPeopleLiveWith"      
 [43] "SatisfiedOtherFamily"          "SatisfiedFamilyLife"          
 [45] "FrequencyFamilyTalk"           "FrequencyFamilyFun"           
 [47] "FrequencyFamilyLearn"          "FrequencyPocketMoney"         
 [49] "HaveGoodClothes"               "HaveAccessComputer"           
 [51] "HaveAccessInternet"            "HaveMobilePhone"              
 [53] "HaveOwnRoom"                   "HaveBooks"                    
 [55] "HaveCar"                       "HaveMusic"                    
 [57] "HaveTV"                        "SatisfiedThingsHave"          
 [59] "NumberAdultsPaidJob"           "FamilyMoneyChange"            
 [61] "ParentsSpendMoney"             "AdultsTalkMoneyProblems"      
 [63] "FriendsNice"                   "FriendsEnough"                
 [65] "SatisfiedFriends"              "SatisfiedPeopleArea"          
 [67] "SatisfiedRelationshipsGeneral" "FrequencyFriendsTalk"         
 [69] "FrequencyFriendsFun"           "FrequencyFriendsStudy"        
 [71] "AreaPlacesToPlay"              "AreaSafeWalk"                 
 [73] "SatisfiedPolice"               "SatisfiedDoctors"             
 [75] "SatisfiedOutdoorAreas"         "SatisfiedAreaGeneral"         
 [77] "TeachersListen"                "LikeSchool"                   
 [79] "TeachersFair"                  "SchoolSafe"                   
 [81] "FrequencyPeersHit"             "FrequencyPeersExclude"        
 [83] "SatisfiedClassmates"           "SatisfiedSchoolMarks"         
 [85] "SatisfiedSchoolExperience"     "SatisfiedLifeAsStudent"       
 [87] "SatisfiedThingsLearned"        "SatisfiedTeachers"            
 [89] "FrequencyClasses"              "FrequencyOrganisedLeisure"    
 [91] "FrequencyReadFun"              "FrequencyHelpHousework"       
 [93] "FrequencyHomework"             "FrequencyWatchTV"             
 [95] "FrequencySportsExercise"       "FrequencyUseComputer"         
 [97] "FrequencyByMyself"             "FrequencyCareForFamily"       
 [99] "SatisfiedTimeUse"              "SatisfiedFreedom"             
[101] "SatisfiedOpportunities"        "SatisfiedHealth"              
[103] "SatisfiedAppearance"           "SatisfiedBody"                
[105] "SatisfiedFreeTime"             "SatisfiedListenedTo"          
[107] "SatisfiedSelfConfidence"       "SatisfiedLifeAsWhole"         
[109] "PastYearMovedHouse"            "PastYearChangedArea"          
[111] "PastYearChangedSchool"         "PastYearOtherCountry"         
[113] "PastYearLiveSameAdults"        "SatisfiedSafety"              
[115] "SatisfiedThingsGoodAt"         "SatisfiedThingsAwayFromHome"  
[117] "SatisfiedLaterInLife"          "SatisfiedChoice"              
[119] "LifeGoingWell"                 "LifeJustRight"                
[121] "HaveGoodLife"                  "HaveWhatWant"                 
[123] "ThingsLifeExcellent"           "KnowRights"                   
[125] "HeardOfUNCRC"                  "AdultsRespectChildRights"     
[127] "LikeWayIAm"                    "ManageResponsibilities"       
[129] "PeopleFriendly"                "EnoughChoiceTime"             
[131] "LearningALot"                  "KnowWhereLifeGoing"           
[133] "FeelLonely"                    "FeelPositiveFuture"           
[135] "FeelingSatisfied"              "FeelingHappy"                 
[137] "FeelingRelaxed"                "FeelingActive"                
[139] "FeelingCalm"                   "FeelingFullOfEnergy"

Notice that in the design.PSU Stratum 2 has 7 PSUs, but in the actual.PSU has 6 PSUs.

This is because one primary sampling unit was lost in the sub-setting, with insufficient observations remaining in the result.

If there were any “lonely” primary sampling units (one’s that only have one case left after sub-setting) we can adjust the “pre-sets” to allow for adjustment (get more into this later!).

Code

options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)

Code

gtsummary::tbl_svysummary(design.subset, include = c("FrequencyWatchTV", "FrequencySportsExercise"), by = FrequencyWatchTV) %>%
  modify_caption("Cross tabulation of time use by 12 year boys in England") %>% add_p()%>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied") %>%
  modify_header(label ~ "FrequencyWatchTV")

Cross tabulation of time use by 12 year boys in England
FrequencyWatchTV	About every day, N = 505¹	Less than weekly, N = 34¹	Once or twice a week, N = 104¹	Rarely or never, N = 9¹	p-value²
FrequencySportsExercise					<0.001
About every day	330 (66%)	13 (39%)	43 (43%)	2 (22%)
Less than weekly	14 (2.8%)	9 (25%)	8 (7.6%)	0 (0%)
Once or twice a week	137 (27%)	9 (26%)	44 (44%)	6 (72%)
Rarely or never	19 (3.8%)	3 (10%)	6 (6.1%)	1 (6.1%)
Unknown	5	0	2	0
¹ Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
² chi-squared test with Rao & Scott’s second-order correction

Task 5

Now, try to subset your dataset by your own conditions.

Create a new sub-setted survey design object
Create a gtsummary table using your sub-setted data
Ensure to properly label your table to reflect the restricted analysis
Spend time perfecting your table, if you wish. What labels can you add? Descriptive statistics?

Code

## write your own code!

Next: Go to Part 1.3: Survey adjusted plots

Footnotes

This is the name of our data.frame in our R environment. It is important that it contains the entire dataset from which the weights were calculated, not a subset where cases may be missing.↩︎