Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the data

data(yrbss)
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender                   <chr> "female", "female", "female", "female", "f...
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race                     <chr> "Black or African American", "Black or Afr...
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71....
## $ helmet_12m               <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...
skim(yrbss)
Table 1: Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 <U+2581><U+2582><U+2585><U+2585><U+2587>
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 <U+2581><U+2585><U+2587><U+2583><U+2581>
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 <U+2586><U+2587><U+2582><U+2581><U+2581>
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 <U+2586><U+2582><U+2585><U+2583><U+2587>
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 <U+2587><U+2582><U+2585><U+2582><U+2585>

Before we carry on with the analysis, we check the dataset with skimr::skim() to get a feel for missing values, summary statistics of numerical variables, and a very rough histogram. As a result, we see that the variable “race” has the most missing values (2805).

Exploratory Data Analysis

We start analyzing the weight of participants in kilograms. Using visualization and summary statistics, we describe the distribution of weights and get visibility into the missing observations for the variable “weight”

weight <- yrbss %>% #we take the dataset and assign it to "weight"
  select(weight) %>% #use the select() function to choose the weight variable
  filter(!is.na(weight)) %>% #we filter out the missing data
  summarise(mean=mean(weight), #calculate the summary statistics of weight variable
            SD=sd(weight),
            median=median(weight),
            min=min(weight),
            max=max(weight))

# show summary statistics
weight%>%
  kbl()%>%
  kable_styling()
mean SD median min max
67.9 16.9 64.4 29.9 181
ggplot(yrbss,aes(x=weight)) + #graph the density plot of weights
  geom_density(fill="light blue",color="blue")+ 
  labs(title="Distribution of weights has positive skew")+
  xlab("Weight in kg")+ 
  theme_bw()+
  NULL

Next, consider the possible relationship between a high schooler’s weight and their physical activity. We create a new variable physical_3plus, which will be yes if they are physically active for at least 3 days a week, and no otherwise.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(physically_active_7d >= 3, "yes", "no")) #introducing a new variable

yrbss %>% filter(!is.na(physical_3plus)) %>%#filtering out missing values for 'physical_3plus' variable
  group_by(physical_3plus) %>% #grouping to calculate proportions
  summarise(count = n()) %>%#counting the results
  mutate(prop= count/sum(count)) #introducing and calculating proportions 
## # A tibble: 2 x 3
##   physical_3plus count  prop
##   <chr>          <int> <dbl>
## 1 no              4404 0.331
## 2 yes             8906 0.669

Then, we provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week, and also visualize the CI.

not_active <- yrbss %>% 
  specify(response = physical_3plus, success="no") %>% 
  generate(reps = 1000, type = "bootstrap") %>% #using bootsrapping to calculate confidence intervals
  calculate(stat = "prop") 

CI_95 <- not_active %>%
  get_ci(level = 0.95, type = "percentile") #calculating 95% CI

CI_95%>%
  kbl()%>%
  kable_styling()
lower_ci upper_ci
0.323 0.339
visualize(not_active) + #plotting distribution for not_active
  shade_ci(endpoints = CI_95,fill = "blue", alpha=0.2)+
  geom_vline(xintercept = CI_95$lower_ci, colour = "red")+
  geom_vline(xintercept = CI_95$upper_ci, colour = "red")+
  theme_stata() #choosing the theme

We Make a boxplot of physical_3plus vs. weight to see if there is a relationship between these two variables. We were expecting to see some relationship as phyisical activity impacts weight - more physical activity is usually expected to result in weight loss.

yrbss1<-yrbss%>%
  filter(physical_3plus!="NA") #filtering out observations that have missing values of 'physical_3plus'

ggplot(yrbss1,aes(x=physical_3plus,y=weight))+
  geom_boxplot(fill="light blue",color="blue")+
  labs(title ="More physical activity ties to lower weight",
       subtitle = "Relationship between having physical activity more than 3 days a week and weight",
       x="Physically active for at least 3 days a week?",
       y="Weight in kg")+
  theme_bw()+
  NULL

The results are partly counterintuitive, as the boxplot graph shows that the median weight of those who workout 3 or more times a week is slightly higher than the median of those who don’t. However, we have to keep in mind that in the group “No” we have data not only for those who do not work out at all, but also people who work out less than 3 times a week. A rationale for the higher median weight could be that people who work out more than 3 times a week have a goal of building more muscle vs. losing weight. Hence, higher median weight. To further explain the intuition behind it, we would need to at least split the “No” group into observations that are physically active 1-2 days a week, and those who are not active at all.

The intuitive part is, however, that working out 3 or more days eliminates the outliers who have 150+ kg weight in the other group of observations.

Confidence Interval

Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test. Note that when we calculate the mean/SD, etc weight in these groups using the mean function, we must ignore any missing values by setting the na.rm = TRUE.

yrbss %>%
  group_by(physical_3plus) %>%
  filter(!is.na(physical_3plus)) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE),
            sd_weight = sd(weight, na.rm=TRUE),
            count = n(),
            se_weight = sd_weight/sqrt(count),
            t_critical = qt(0.975, count-1), #calculating statistics for 95% CI
            margin_of_error = t_critical * se_weight,
            lower = mean_weight - t_critical * se_weight, #calculating CI
            upper = mean_weight + t_critical * se_weight) # calculating CI
## # A tibble: 2 x 9
##   physical_3plus mean_weight sd_weight count se_weight t_critical
##   <chr>                <dbl>     <dbl> <int>     <dbl>      <dbl>
## 1 no                    66.7      17.6  4404     0.266       1.96
## 2 yes                   68.4      16.5  8906     0.175       1.96
## # ... with 3 more variables: margin_of_error <dbl>, lower <dbl>, upper <dbl>

There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

Hypothesis test with formula

Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least times a week and those who don’t.

t.test(weight ~ physical_3plus, data = yrbss)
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.42 -1.12
## sample estimates:
##  mean in group no mean in group yes 
##              66.7              68.4

Hypothesis test with infer

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

But first, we need to initialize the test, which we will save as obs_diff.

H0: difference in means of students who are physically active 3 or more days a week and students who are not is equal to 0 H1: difference in means of students who are physically active 3 or more days a week and students who are not is not equal to 0

obs_diff <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

After you have initialized the test, we need to simulate the test on the null distribution, which we will save as null.

null_dist <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram(fill="light blue",color="blue")+
  labs(title="The null distribution is normal")+
  theme_bw()+
  NULL

Now that the test is initialized and the null distribution formed, we can visualise to see how many of these null permutations have a difference of at least obs_stat of 1.77?

We can also calculate the p-value for the hypothesis test using the function infer::get_p_value().

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

To sum up, we initially got the feel of the data by plotting it with boxplot and saw the difference in the two groups. To validate the difference of mean weight in the two groups, we performed hypothesis testing and as a result rejected the null hypothesis. There is enough evidence to suggest that there is statistically significant difference in the mean weight of students who are physically active 3 or more days a week and the mean weight of students who are not.