1.2: Survey adjusted descriptive statistics

This section covers: creating a survey design object, adjusted descriptive statistics and tables!


Load data

Remember that we saved the wrangled dataset to the working directory in the previous section as an RData file in our workshop folder.

Code
root <- "C://Users//s1769862//OneDrive - University of Edinburgh//SLTF-workshop-August2024//" # change with where the folder lies in your filepath
cws <- readRDS(file= file.path(root, "Data", "cws.RData"))

Sampling design

How do we handling the sampling design of this survey? First, we read the Data Documentation to learn more about the methodology and any population correcting weighting that we can apply.

Take a minute to search for this information on your own. What clues can you draw out?


Looking at the documentation, some key elements appear to be….

1.2 Sampling strategy and outcomes (pg. 4)

“The England sample was designed to achieve a nationally representative sample of children in school….. The primary sampling unit was schools. Separate samples were drawn for Years 4 and 6 (primary school education) and Year 8 (secondary school education). Both samples followed the same methodology. First, a complete list of schools in England was stratified into five groups by the proportion of children receiving free school meals (a very rough indicator of economic prosperity).”

“These groups were approximate quintiles (based on numbers of pupils in each stratum). The approximation was because of a lack of precision in the data available on free school meal entitlement. Within each stratum schools were selected randomly with probabilities proportional to size (number of pupils), with the aim of achieving a target of at least eight schools per stratum. Within each selected school, one class group (not grouped on pupil ability) was randomly selected.”


Reading this paragraph, we can see that this dataset has…

  1. England divided into five stratum
  2. Schools are the primary sampling unit within each stratum
  3. Individual weights to readjust for sample representativeness

From this information, we can create what is called a “survey design object”. This is where the survey design information is encoded into our R working environment and can be called into other statistical functions.

Note

We can also look at the Survey package documentation for more details and help using ?? survey or help(svydesign) into the console

The “ids” indicates the primary sampling unit (PSU). This can be thought of as the smallest unit which individuals are sampled from.

Formula for cluster sampling probabilities.

How the population was originally divided.

Finite population correction.

The individual probability of selection.

The name of the data frame in R working memory.

In our dataset the column names for these are:

  • ids: SchoolRef
  • strata: Strata
  • weights: Weight
  • data: cws 1.

We do not have a finite population correction or cluster sampling probability information, so we leave these fields blank.

Tip

We also include “check.strata = TRUE” as a sense check to ensure that our primary sampling units (the schools) are all nested within our strata. This should return nothing if all is well.

Code
design1 <- survey::svydesign(ids = ~SchoolRef, strata =~Strata, weights = ~Weight, data =cws, check.strata=T)

We now have a survey design object which stores both our original dataset and the encoded information about its sampling and weighting.

Descriptive statistics

Now, we can start to use the survey design object to display statistics that are adjusted for the sampling design and weighting.

Code
survey::svymean(~LifeGoingWell, design1, na.rm=TRUE)
                mean     SE
LifeGoingWell 8.3951 0.1027

Let’s compare these to the unweighted version, using “base” r.

How does accounting for the sampling design and population weighting change our estimates?

Code
# estimate un-weighted mean and sd 
base::mean(cws$LifeGoingWell, na.rm=T) # mean
[1] 8.451588
Code
## compare with survey weighted estimates
(survey::svymean(~LifeGoingWell, design1, na.rm=T)) - (mean(cws$LifeGoingWell, na.rm=T)) # can see a modest correction in the estimate
                   mean     SE
LifeGoingWell -0.056477 0.1027

Comparing the two estimates, we can see that the estimate for “life going well” is slightly lower for the population estimates compared to the face-value calculation.

Variance and standard error

We can also see that the standard error around the mean is larger for the survey weighted result. Therefore, our error estimates would be underestimated if we assumed a simple random sample and ignored the sampling design.

Code
sd((cws$LifeGoingWell), na.rm=TRUE)/sqrt(length(!is.na(cws$LifeGoingWell))) # standard error
[1] 0.05800934
Code
survey::svyvar(~LifeGoingWell, design1, na = TRUE) # survey weighted estimate
              variance     SE
LifeGoingWell   4.6378 0.4703

Deciles

Unfortunately, the decile results do not display well in the .quarto website output. However, I recommend that you run these lines on your own machine to see the raw output

Code
survey::svyquantile(~LifeGoingWell, design1, na = TRUE, c(.25,.5,.75),ci=TRUE)
Code
survey::svyquantile(~LifeGoingWell, design1, na = TRUE, c(.20,.40,.60,.80),ci=TRUE)

Tables

We can also create adjusted tables with the survey package.

Code
survey::svytable(~LifeGoingWell, design1)
LifeGoingWell
         0          1          2          3          4          5          6 
 11.810352   4.225975  16.652450  26.043925  42.924747  43.591784  60.086619 
         7          8          9         10 
101.180865 166.925511 241.002194 572.818609 

We can see how these estimates differ from the unweighted estimates provided in from the base table estimates.

Code
base::table(cws$LifeGoingWell)

  0   1   2   3   4   5   6   7   8   9  10 
 10   6  14  21  41  48  61  95 166 239 590 

Tables are helpful for seeing the weighting option in our survey design object at work. However, they are not very pretty! This becomes more of a problem for larger or more complex tables, e.g. cross tabulations.

We also often want to present multiple types of descriptive statistics, e.g. the adjusted variance or other attributes of a distribution, within the same table.

For this, we need a better table package.

Tables with gtsummary

We can improve descriptive statistics summaries by utilising table packages.

The GT framework has a good set of table output options for descriptive statistics, bivariate and regression results.

The the gtsummary package is a wrapper for survey that was inspired by GT but makes some handy decisions and incorporates the survey design into the descriptive statistics reporting.

Code
gtsummary::tbl_svysummary(design1,
                          include = c("FrequencyHelpHousework", "HeardOfUNCRC"),
                          missing_text = "NA")

We can also make tables for continuous variables

Code
gtsummary::tbl_svysummary(design1,
                          include = c("FeelPositiveFuture", "FeelingHappy", "PeopleFriendly", "LifeJustRight"),
                          statistic = list(all_continuous() ~ "{mean} ({sd})"),
                          missing = "no")
Characteristic N = 1,3191
I feel positive about my future 8.20 (2.32)
Last two weeks: How often feeling happy 8.34 (2.40)
People are generally pretty friendly towards me 8.23 (2.37)
My life is just right 8.14 (2.47)
1 Mean (SD)

Summary table example 2

Here is another example comparing the weighted and unweighted summary tables for changes in household spending from the 12-year-olds’ perspectives.

First, a table using base R.

Code
base::table(cws$FamilyMoneyChange, exclude=F) 

   1    2    3 <NA> 
 425  316  134  444 
Code
gtsummary::tbl_summary(cws, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1)))
Characteristic N = 1,3191
Q15 Compared to a year ago, how much money does your family have now?
    1 425 (48.6%)
    2 316 (36.1%)
    3 134 (15.3%)
    Unknown 444
1 n (%)

If we wish to have the categorical responses listed, rather than factor numeric values, we need to change the dataset itself.

Code
cws <- cws %>%
  mutate(FamilyMoneyChange = as.factor(case_when(
    FamilyMoneyChange == 1 ~ "We have more money than a year ago",
    FamilyMoneyChange == 2 ~ "We have about the same as a year ago",
    FamilyMoneyChange == 3 ~ "We have less money than a year ago")))


gtsummary::tbl_summary(cws, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Unweighted frequences for 12-year-olds perceptions' of family finances") %>% modify_footnote(update = everything() ~ "Data Source: 2013/14 Childrens World Survey, England")
Unweighted frequences for 12-year-olds perceptions’ of family finances
Characteristic1 N = 1,3191
FamilyMoneyChange
    We have about the same as a year ago 316 (36.1%)
    We have less money than a year ago 134 (15.3%)
    We have more money than a year ago 425 (48.6%)
    Unknown 444
1 Data Source: 2013/14 Childrens World Survey, England

If we have changed anything in the original dataset, such as recoding a variable (even to just make it more clearly labelled!) this will not automatically updated in the survey design object. If we wish to update this, we need to recreate it with the updated dataset.

Tip

It is often best practice to complete all data manipulation (e.g. recodes) before setting the survey design object.

Code
design1 <- survey::svydesign(ids = ~SchoolRef, strata =~Strata, weights = ~Weight, data =cws, check.strata=T)

Now, if we run the same code as the previous chunk, it will updated to have the character names for our categorical variable encoded in the table.

Code
gtsummary::tbl_svysummary(design1, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Weighted frequences for 12-year-olds perceptions' of family finances") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")
Weighted frequences for 12-year-olds perceptions’ of family finances
Characteristic N = 1,3191
FamilyMoneyChange
    We have about the same as a year ago 339 (38.0%)
    We have less money than a year ago 143 (16.1%)
    We have more money than a year ago 410 (46.0%)
    Unknown 427
1 Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
NA column

Notice that the “number” of missing values has changed between the unweighted, base R table and the gtsummary table. This is because the distribution of item level non response are associated with different weightings. Larger NA estimates in the weighted tables indicate that observations which did not offer a valid response to the question have higher weighting, on average, resulting in enlarged influence of this category in the nationally representative statistics

Task 3

Produce a weighted frequency table of a variable of interest.

Now, try it yourself! Create survey weighted summary statistics for a different variables in the dataset.

Hint

Look up the data documentation with help(tbl_svysymmary)

  • What settings can you play around with?
  • How can you use this tool to explore your dataset?
Code
## enter your own code here



## hint: copy segments from previous code chunks

A possible solution can be viewed below:

Code
gtsummary::tbl_svysummary(design1, include = c("FamilyMoneyChange"), digits = list(all_categorical() ~ c(0, 1))) %>%
  modify_caption("Weighted frequences for 12-year-olds perceptions' of family finances") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")
Weighted frequences for 12-year-olds perceptions’ of family finances
Characteristic N = 1,3191
FamilyMoneyChange
    We have about the same as a year ago 339 (38.0%)
    We have less money than a year ago 143 (16.1%)
    We have more money than a year ago 410 (46.0%)
    Unknown 427
1 Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied

Cross tabulations

We can also calculate cross tabulations which take into account our survey design corrections and weighting by utilising the “by” option in tbl_svysummary

Code
gtsummary::tbl_svysummary(design1,
                          include = c("FeelPositiveFuture", "FeelingHappy", "PeopleFriendly", "LifeJustRight", "FamilyMoneyChange"),
                          by = FamilyMoneyChange,
                          statistic = list(all_continuous() ~ "{mean} ({sd})"))
Characteristic We have about the same as a year ago, N = 3391 We have less money than a year ago, N = 1431 We have more money than a year ago, N = 4101
I feel positive about my future 8.25 (2.11) 7.30 (2.68) 8.67 (2.12)
    Unknown 14 3 9
Last two weeks: How often feeling happy 8.38 (2.12) 7.21 (2.85) 8.79 (2.11)
    Unknown 4 4 16
People are generally pretty friendly towards me 8.32 (2.13) 7.30 (2.88) 8.69 (2.03)
    Unknown 9 0 3
My life is just right 8.34 (2.04) 7.05 (2.84) 8.47 (2.37)
    Unknown 3 2 14
1 Mean (SD)

If we use two categorical variables instead of one categorical and one “continuous” (an ordinal Likert scale, in this case), the cross table will also compute a \(x^2\) statistic when we do the add_p() option.

Code
gtsummary::tbl_svysummary(design1, include = c("FrequencyWatchTV", "FrequencySportsExercise"), by = FrequencyWatchTV) %>%
  modify_caption("Cross tabulation of time use by 12 year olds in England") %>%
  add_p()%>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied") %>%
  modify_header(label ~ "FrequencyWatchTV")
Cross tabulation of time use by 12 year olds in England
FrequencyWatchTV About every day, N = 1,0561 Less than weekly, N = 461 Once or twice a week, N = 1621 Rarely or never, N = 161 p-value2
FrequencySportsExercise



<0.001
    About every day 574 (55%) 16 (34%) 64 (41%) 5 (29%)
    Less than weekly 58 (5.6%) 13 (29%) 18 (12%) 0 (0%)
    Once or twice a week 345 (33%) 11 (25%) 67 (42%) 9 (58%)
    Rarely or never 60 (5.8%) 6 (13%) 9 (5.6%) 2 (13%)
    Unknown 20 0 4 0
1 Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
2 chi-squared test with Rao & Scott’s second-order correction

We can produce a variety of bivariate test results within our gtsummary tables, including t-tests.

A full list can be found here

Code
gtsummary::tbl_svysummary(design1,
                          include = c("Gender", "SatisfiedAppearance"),
                          by = Gender) %>%
  modify_caption("Cross tabulation of satisfaction with personal appearance from 12 year olds in England, by child's gender") %>%
  add_p(test = all_continuous() ~ "svy.t.test") %>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied")
Cross tabulation of satisfaction with personal appearance from 12 year olds in England, by child’s gender
Characteristic Boy, N = 6731 Girl, N = 6341 p-value2
Satisfaction with: The way that you look 9.0 (7.0, 10.0) 7.0 (5.0, 9.0) <0.001
    Unknown 11 13
1 Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
2 t-test adapted to complex survey samples

Task 4

Produce a weighted cross tabulation of two variables of interest and include an appropriate inferential test for the type of variable.

Code
## your code goes here
Summary tips
  1. Read the dataset documentation clearly to identify all elements of the sampling design provided. Ensure there are no missing values for any columns with design information.

  2. If we make any changes to our data set we need to update the survey design object.

  3. The gtsummary package is great, but the syntax is confusing at first! Get used to spending time reading its helpful data documentation.

Sub-setting

Previously we discussed that it is important to not alter the number of cases (in this case, individuals in the dataset) from the full number utilised by the data managers to calculate the survey weights.

What happens when we change our dataset or work with a sub-sample?

We may wish to restrict our data frame due to..

Set of columns Sub-sample
Different number of columns Different number of units
Need to specify new dataset Need to specify the reduction in units
A matter of code change A matter of mathematical change

We already demonstrated the latter in a previous example. Now, we explore how to conduct sub-sample analyses.

Sub-sample analysis: by gender

We create a new design object which is a subset of our original object by a condition of our choice

Code
design.subset <- subset(design1, Gender == "Boy")

We can compare the number of primary sampling units (in this case schools!) between our original design and the subset

Code
summary(design.subset)
Stratified 1 - level Cluster Sampling design (with replacement)
With (33) clusters.
subset(design1, Gender == "Boy")
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2008  0.6842  1.2032  2.0543  4.2270  5.6452 
Stratum Sizes: 
             1  2   3   4   5
obs        246 98 110 106 110
design.PSU   8  7   7   6   7
actual.PSU   7  6   7   6   7
Data variables:
  [1] "QRef"                          "SchoolRef"                    
  [3] "Strata"                        "Weight"                       
  [5] "Age"                           "Gender"                       
  [7] "BornThisCountry"               "NHomes"                       
  [9] "HomeType"                      "FirstHomeMother"              
 [11] "FirstHomeFather"               "FirstHomeMPartner"            
 [13] "FirstHomeFPartner"             "FirstHomeGrandmother"         
 [15] "FirstHomeGrandfather"          "FirstHomeSiblings"            
 [17] "FirstHomeOtherChildren"        "FirstHomeOtherAdults"         
 [19] "SecondHomeMother"              "SecondHomeFather"             
 [21] "SecondHomeMPartner"            "SecondHomeFPartner"           
 [23] "SecondHomeGrandmother"         "SecondHomeGrandfather"        
 [25] "SecondHomeSiblings"            "SecondHomeOtherChildren"      
 [27] "SecondHomeOtherAdults"         "HomeSafe"                     
 [29] "HomePlaceToStudy"              "ParentsListen"                
 [31] "FamilyGoodTimeTogether"        "ParentsTreatFairly"           
 [33] "ParentsHelpProblems"           "ParentsAskWhereGoing"         
 [35] "ParentsKeepTrackSchool"        "ParentsRemindWashShower"      
 [37] "ParentsShowInterestSchool"     "ParentsSupportUpset"          
 [39] "ParentsKnowWhereAfterSchool"   "ParentsEnsureSeeDoctor"       
 [41] "SatisfiedHouse"                "SatisfiedPeopleLiveWith"      
 [43] "SatisfiedOtherFamily"          "SatisfiedFamilyLife"          
 [45] "FrequencyFamilyTalk"           "FrequencyFamilyFun"           
 [47] "FrequencyFamilyLearn"          "FrequencyPocketMoney"         
 [49] "HaveGoodClothes"               "HaveAccessComputer"           
 [51] "HaveAccessInternet"            "HaveMobilePhone"              
 [53] "HaveOwnRoom"                   "HaveBooks"                    
 [55] "HaveCar"                       "HaveMusic"                    
 [57] "HaveTV"                        "SatisfiedThingsHave"          
 [59] "NumberAdultsPaidJob"           "FamilyMoneyChange"            
 [61] "ParentsSpendMoney"             "AdultsTalkMoneyProblems"      
 [63] "FriendsNice"                   "FriendsEnough"                
 [65] "SatisfiedFriends"              "SatisfiedPeopleArea"          
 [67] "SatisfiedRelationshipsGeneral" "FrequencyFriendsTalk"         
 [69] "FrequencyFriendsFun"           "FrequencyFriendsStudy"        
 [71] "AreaPlacesToPlay"              "AreaSafeWalk"                 
 [73] "SatisfiedPolice"               "SatisfiedDoctors"             
 [75] "SatisfiedOutdoorAreas"         "SatisfiedAreaGeneral"         
 [77] "TeachersListen"                "LikeSchool"                   
 [79] "TeachersFair"                  "SchoolSafe"                   
 [81] "FrequencyPeersHit"             "FrequencyPeersExclude"        
 [83] "SatisfiedClassmates"           "SatisfiedSchoolMarks"         
 [85] "SatisfiedSchoolExperience"     "SatisfiedLifeAsStudent"       
 [87] "SatisfiedThingsLearned"        "SatisfiedTeachers"            
 [89] "FrequencyClasses"              "FrequencyOrganisedLeisure"    
 [91] "FrequencyReadFun"              "FrequencyHelpHousework"       
 [93] "FrequencyHomework"             "FrequencyWatchTV"             
 [95] "FrequencySportsExercise"       "FrequencyUseComputer"         
 [97] "FrequencyByMyself"             "FrequencyCareForFamily"       
 [99] "SatisfiedTimeUse"              "SatisfiedFreedom"             
[101] "SatisfiedOpportunities"        "SatisfiedHealth"              
[103] "SatisfiedAppearance"           "SatisfiedBody"                
[105] "SatisfiedFreeTime"             "SatisfiedListenedTo"          
[107] "SatisfiedSelfConfidence"       "SatisfiedLifeAsWhole"         
[109] "PastYearMovedHouse"            "PastYearChangedArea"          
[111] "PastYearChangedSchool"         "PastYearOtherCountry"         
[113] "PastYearLiveSameAdults"        "SatisfiedSafety"              
[115] "SatisfiedThingsGoodAt"         "SatisfiedThingsAwayFromHome"  
[117] "SatisfiedLaterInLife"          "SatisfiedChoice"              
[119] "LifeGoingWell"                 "LifeJustRight"                
[121] "HaveGoodLife"                  "HaveWhatWant"                 
[123] "ThingsLifeExcellent"           "KnowRights"                   
[125] "HeardOfUNCRC"                  "AdultsRespectChildRights"     
[127] "LikeWayIAm"                    "ManageResponsibilities"       
[129] "PeopleFriendly"                "EnoughChoiceTime"             
[131] "LearningALot"                  "KnowWhereLifeGoing"           
[133] "FeelLonely"                    "FeelPositiveFuture"           
[135] "FeelingSatisfied"              "FeelingHappy"                 
[137] "FeelingRelaxed"                "FeelingActive"                
[139] "FeelingCalm"                   "FeelingFullOfEnergy"          

Notice that in the design.PSU Stratum 2 has 7 PSUs, but in the actual.PSU has 6 PSUs.

This is because one primary sampling unit was lost in the sub-setting, with insufficient observations remaining in the result.

If there were any “lonely” primary sampling units (one’s that only have one case left after sub-setting) we can adjust the “pre-sets” to allow for adjustment (get more into this later!).

Code
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)
Code
gtsummary::tbl_svysummary(design.subset, include = c("FrequencyWatchTV", "FrequencySportsExercise"), by = FrequencyWatchTV) %>%
  modify_caption("Cross tabulation of time use by 12 year boys in England") %>% add_p()%>%
  modify_footnote(all_stat_cols() ~ "Data Source: 2013/14 Children's World Survey, England; Survey weighting applied") %>%
  modify_header(label ~ "FrequencyWatchTV")
Cross tabulation of time use by 12 year boys in England
FrequencyWatchTV About every day, N = 5051 Less than weekly, N = 341 Once or twice a week, N = 1041 Rarely or never, N = 91 p-value2
FrequencySportsExercise



<0.001
    About every day 330 (66%) 13 (39%) 43 (43%) 2 (22%)
    Less than weekly 14 (2.8%) 9 (25%) 8 (7.6%) 0 (0%)
    Once or twice a week 137 (27%) 9 (26%) 44 (44%) 6 (72%)
    Rarely or never 19 (3.8%) 3 (10%) 6 (6.1%) 1 (6.1%)
    Unknown 5 0 2 0
1 Data Source: 2013/14 Children’s World Survey, England; Survey weighting applied
2 chi-squared test with Rao & Scott’s second-order correction

Task 5

Now, try to subset your dataset by your own conditions.

  • Create a new sub-setted survey design object
  • Create a gtsummary table using your sub-setted data
  • Ensure to properly label your table to reflect the restricted analysis
  • Spend time perfecting your table, if you wish. What labels can you add? Descriptive statistics?
Code
## write your own code!

Next: Go to Part 1.3: Survey adjusted plots

Footnotes

  1. This is the name of our data.frame in our R environment. It is important that it contains the entire dataset from which the weights were calculated, not a subset where cases may be missing.↩︎