cdcdata-exercise

Exploring CDC Data: Vaccination Coverage Among Pregnant Women

Description of the data

I found this data set from the CDC’s data log under the pregnancy and vaccination section. The data was provided by the National Center for Immunization and Respiratory Diseases (NCIRD), and the information was collected through the Pregnancy Risk Assessment Monitoring System (PRAMS). Information includes vaccination rates at the state level for influenza and tetanus toxoid, reduced diptheria toxoid, and aceullar pertussis (Tdap) for women who recently gave birth. Variables include vaccine type, geography, survey year/influenza season, dimension (age or race), and estimation of vaccination coverage with 95% confidence interval. It was most recently updated on December 15, 2023 and contains 4,379 observations of 9 variables. Link: https://data.cdc.gov/Pregnancy-Vaccination/Vaccination-Coverage-among-Pregnant-Women/h7pm-wmjc/about_data

Read and load the data

I loaded two packages for this step: readr and here. I called the data ‘pregvacc’. After reading and loading the data, I confirmed that there were 4,379 observations of 9 variables.

## load packages
library(readr)
library(here)

here() starts at /Users/taylorglass/Documents/MADA /taylorglass-MADA-portfolio

##read in data and find dimensions
pregvacc <- read_csv(here("cdcdata-exercise", "Vaccination_Coverage_among_Pregnant_Women_20240205.csv"))

Rows: 4379 Columns: 9

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Vaccine, Geography Type, Geography, Dimension Type, Dimension, Esti...
dbl (2): Survey Year/Influenza Season, Sample Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(pregvacc)

[1] 4379    9

Processing the data

Using the naniar package to discover missingness of the data shows that this data set is pretty clean. Sample size is the only variable with missing data, and it has 215 missing observations. Sample size is an important variable for determining the reliability of the vaccination coverage estimation percentages. The ‘data quality’ section of the CDC page for this data set notes that observations marked with an asterisk are unreliable due to a sample size of less than 30. After exploring the variable, the minimum value is exactly 30, so none of the estimates are less than 30. I feel confident in removing this variable from the data set because all of the vaccination coverage estimates with complete data are reliable based on the sample size variable.

## load packages
library(naniar)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)

## explore missingness of the data 
gg_miss_var(pregvacc)

sum(is.na(pregvacc$`Sample Size`))

[1] 215

summary(pregvacc$`Sample Size`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   30.0   167.0   320.5   797.2   655.2 43737.0     215

## create new dataset without the sample size variable
pregvacc2 <- pregvacc %>% select(-`Sample Size`)

Using the head() function, I saw “NR” in the estimate of vaccination coverage and confidence interval variables, which also represents missing data. With assistance from AI tools, I wrote a function to rid the data of any “NR” variables or special characters using the dplyr and stringr packages. I found that there are 442 missing observations of the estimate variable and confidence interval variables. Considering that the data set contains 4,379 observations, only 10% of the observations have a missing estimate value for vaccination coverage. I think it is okay to remove the missing observations in this scenario because each state has multiple observations across several years of PRAMS data. Information about vaccination among pregnant women is included for all 50 states based on various ages and races/ethnicities, so I think it is reasonable to remove 10% of the data set for this analysis.

## examine data again 
head(pregvacc2)

# A tibble: 6 × 8
  Vaccine   `Geography Type` Geography Survey Year/Influenza …¹ `Dimension Type`
  <chr>     <chr>            <chr>                        <dbl> <chr>           
1 Influenza States           Alaska                        2012 Age             
2 Influenza States           Alaska                        2020 Age             
3 Influenza States           Alaska                        2020 Race and Ethnic…
4 Influenza States           Alaska                        2012 Age             
5 Influenza States           Alaska                        2012 Race and Ethnic…
6 Influenza States           Alaska                        2020 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 3 more variables: Dimension <chr>, `Estimate (%)` <chr>, `95% CI (%)` <chr>

## create a function to clean the data
clean_data <- function(data) {
  data[data == 'NR*'] <- NA
  data <- data %>% mutate_if(is.character, str_replace_all, pattern = "[^a-zA-Z0-9\\s]", replacement = "")
  return(data)
}

## use the clean_data function to create the third version of the dataset
pregvacc3 <- clean_data(pregvacc2)

## explore the new NAs created by the clean_data function 
gg_miss_var(pregvacc3)

sum(is.na(pregvacc3$`Estimate (%)`))

[1] 442

sum(is.na(pregvacc3$`95% CI (%)`))

[1] 442

missing <- 442/4379*100
head(pregvacc3)

# A tibble: 6 × 8
  Vaccine   `Geography Type` Geography Survey Year/Influenza …¹ `Dimension Type`
  <chr>     <chr>            <chr>                        <dbl> <chr>           
1 Influenza States           Alaska                        2012 Age             
2 Influenza States           Alaska                        2020 Age             
3 Influenza States           Alaska                        2020 Race and Ethnic…
4 Influenza States           Alaska                        2012 Age             
5 Influenza States           Alaska                        2012 Race and Ethnic…
6 Influenza States           Alaska                        2020 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 3 more variables: Dimension <chr>, `Estimate (%)` <chr>, `95% CI (%)` <chr>

After dropping the NA values, the data set contains 3,937 observations of 8 variables with no missing values. The head() function revealed that the estimate percentage is a character variable, which is incorrect because it should be a numeric variable. I mutated the variable to be numeric and divided by 100, so it will have the typical percentage format.

## create final dataset with no NAs to be used for exploratory analyis 
pregvacc4 <- na.omit(pregvacc3)
dim(pregvacc4)

[1] 3937    8

sum(is.na(pregvacc4))

[1] 0

## check that all variables appear to be complete
head(pregvacc4)

# A tibble: 6 × 8
  Vaccine   `Geography Type` Geography Survey Year/Influenza …¹ `Dimension Type`
  <chr>     <chr>            <chr>                        <dbl> <chr>           
1 Influenza States           Alaska                        2012 Age             
2 Influenza States           Alaska                        2020 Age             
3 Influenza States           Alaska                        2012 Age             
4 Influenza States           Alaska                        2020 Age             
5 Influenza States           Alaska                        2020 Age             
6 Influenza States           Alaska                        2020 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 3 more variables: Dimension <chr>, `Estimate (%)` <chr>, `95% CI (%)` <chr>

tail(pregvacc4)

# A tibble: 6 × 8
  Vaccine `Geography Type` Geography Survey Year/Influenza Se…¹ `Dimension Type`
  <chr>   <chr>            <chr>                          <dbl> <chr>           
1 Tdap    States           Minnesota                       2018 Race and Ethnic…
2 Tdap    States           Minnesota                       2018 Age             
3 Tdap    States           Minnesota                       2019 Age             
4 Tdap    States           Minnesota                       2020 Age             
5 Tdap    States           Minnesota                       2019 Age             
6 Tdap    States           Minnesota                       2018 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 3 more variables: Dimension <chr>, `Estimate (%)` <chr>, `95% CI (%)` <chr>

## clean vaccination estimate variable to follow normal percentage format 
pregvacc4 <- pregvacc4 %>% 
                mutate(`Estimate (%)` = as.numeric(`Estimate (%)`)) %>%
                mutate(`Estimate (%)` = `Estimate (%)` / 100)
class(pregvacc4$`Estimate (%)`) # check if class mutation worked

[1] "numeric"

summary(pregvacc4$`Estimate (%)`) # check if format mutation worked

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.520   5.370   6.380   6.234   7.280   9.960

The 95% confidence interval has the same issue of being a character variable instead of numeric variable, and the upper and lower bounds are separated by the word “to”, which prevents me from directly converting it to numeric. I separated the variable into upper and lower bounds using the separate() function from the tidyr package, so I could convert the variable to the numeric format. I also had to divide the values by 100 once they were numeric, so it would be in a normal percentage format.

## load packages 
library(tidyr)

## separate the confidence interval variable to convert it to numeric in a typical percent format
pregvacc4 <- pregvacc4 %>% 
                separate(`95% CI (%)`, into = c("95% CI Lower", "95% CI Upper"), sep = "to", convert = TRUE)
head(pregvacc4) # check if separation worked

# A tibble: 6 × 9
  Vaccine   `Geography Type` Geography Survey Year/Influenza …¹ `Dimension Type`
  <chr>     <chr>            <chr>                        <dbl> <chr>           
1 Influenza States           Alaska                        2012 Age             
2 Influenza States           Alaska                        2020 Age             
3 Influenza States           Alaska                        2012 Age             
4 Influenza States           Alaska                        2020 Age             
5 Influenza States           Alaska                        2020 Age             
6 Influenza States           Alaska                        2020 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 4 more variables: Dimension <chr>, `Estimate (%)` <dbl>,
#   `95% CI Lower` <dbl>, `95% CI Upper` <int>

pregvacc4 <- pregvacc4 %>% 
                mutate(`95% CI Lower` = as.numeric(`95% CI Lower`) / 100) %>% 
                mutate(`95% CI Upper` = as.numeric(`95% CI Upper`) / 100)
head(pregvacc4) # check if mutation worked

# A tibble: 6 × 9
  Vaccine   `Geography Type` Geography Survey Year/Influenza …¹ `Dimension Type`
  <chr>     <chr>            <chr>                        <dbl> <chr>           
1 Influenza States           Alaska                        2012 Age             
2 Influenza States           Alaska                        2020 Age             
3 Influenza States           Alaska                        2012 Age             
4 Influenza States           Alaska                        2020 Age             
5 Influenza States           Alaska                        2020 Age             
6 Influenza States           Alaska                        2020 Age             
# ℹ abbreviated name: ¹`Survey Year/Influenza Season`
# ℹ 4 more variables: Dimension <chr>, `Estimate (%)` <dbl>,
#   `95% CI Lower` <dbl>, `95% CI Upper` <dbl>

Exploratory/Descriptive Analysis on the clean data

After cleaning and processing the data, the final data set has 3,937 observations of 9 variables. The categorical variables include vaccine, geography type, geography, dimension type, and dimension. The vaccine variable has two categories: influenza and tdap. The influenza vaccine is twice as common as the tdap vaccine considering that it accounts for 66.6% of vaccine observations. I used AI tools to help create a simple table displaying what percent is in each category.

## find basic facts about final clean dataset
dim(pregvacc4)

[1] 3937    9

str(pregvacc4)

tibble [3,937 × 9] (S3: tbl_df/tbl/data.frame)
 $ Vaccine                     : chr [1:3937] "Influenza" "Influenza" "Influenza" "Influenza" ...
 $ Geography Type              : chr [1:3937] "States" "States" "States" "States" ...
 $ Geography                   : chr [1:3937] "Alaska" "Alaska" "Alaska" "Alaska" ...
 $ Survey Year/Influenza Season: num [1:3937] 2012 2020 2012 2020 2020 ...
 $ Dimension Type              : chr [1:3937] "Age" "Age" "Age" "Age" ...
 $ Dimension                   : chr [1:3937] "18 Years" "18 Years" "35 Years" "35 Years" ...
 $ Estimate (%)                : num [1:3937] 4.92 6.4 5.49 6.47 6.27 6.68 4.89 4.83 4.68 5.92 ...
 $ 95% CI Lower                : num [1:3937] 4.53 6.04 4.36 5.6 5.79 5.92 4.18 4.3 4.12 5.35 ...
 $ 95% CI Upper                : num [1:3937] 5.31 6.75 6.58 7.28 6.73 7.38 5.61 5.36 5.24 6.47 ...

## create percentage table with categories of vaccine variable
vacc_table <- table(pregvacc4$Vaccine)
percentage_table <- prop.table(vacc_table) * 100
print(percentage_table)


Influenza      Tdap 
 66.67513  33.32487

Using the code above as a template, I created a summary table for each categorical variable. The geography type variable is 95.9% states, with the other 4.1% of observations being at the national level. The geography categorical variable has 51 options with differentiation between New York City and the rest of New York state and inclusion of Puerto Rico. Wisconsin makes up the highest percentage of the geography variable with 4.06% of observations, and Indiana makes up the lowest percentage at 0.2% of observation. The two dimension types are split almost evenly with 55.65% of observations belonging to age and 44.35% of observations belonging to race and ethnicity. There are 4 categories for age: 18 years, 18-24 years, 25-34 years, and 35 years. Each of these age categories account for about 14% of dimension observations. There are 4 categories for race/ethnicity: Black Non-Hispanic, Hispanic, Other or Multiple Races Non-Hispanic, and White Non-Hispanic. White Non-Hispanic is the largest of the race and ethnicity categories with 13.86% of observations for the dimension variable.

## create percentage table with categories of geography type variable
geotype_table <- table(pregvacc4$`Geography Type`)
percentage_table2 <- prop.table(geotype_table) * 100
print(percentage_table2)


 National    States 
 4.064008 95.935992

## create percentage table with categories of geography variable
geo_table <- table(pregvacc4$Geography)
percentage_table3 <- prop.table(geo_table) * 100
print(percentage_table3)


             Alabama               Alaska              Arizona 
           1.2446025            1.5494031            0.1778004 
            Arkansas             Colorado          Connecticut 
           1.9558039            3.0988062            1.6256033 
            Delaware District of Columbia              Florida 
           4.0132080            0.7112014            0.3556007 
             Georgia               Hawaii             Illinois 
           1.2192024            2.1082042            2.8194056 
             Indiana                 Iowa               Kansas 
           0.2032004            2.3114046            0.8890018 
            Kentucky            Louisiana                Maine 
           0.6096012            2.2606045            1.2954026 
            Maryland        Massachusetts             Michigan 
           1.8288037            3.2512065            2.9464059 
           Minnesota          Mississippi             Missouri 
           1.8288037            1.2192024            3.4798070 
             Montana             Nebraska        New Hampshire 
           1.6256033            2.8448057            1.6510033 
          New Jersey           New Mexico             New York 
           2.0320041            1.9558039            3.2004064 
      North Carolina         North Dakota   NYCity of New York 
           0.6096012            0.8128016            3.4544069 
     NYRest of state                 Ohio             Oklahoma 
           2.7178054            0.5588011            3.0226060 
              Oregon         Pennsylvania          Puerto Rico 
           1.4224028            3.8608077            0.6350013 
        Rhode Island         South Dakota            Tennessee 
           2.0320041            1.0160020            1.1176022 
               Texas        United States                 Utah 
           0.8128016            4.0640081            2.7686055 
             Vermont             Virginia           Washington 
           2.4384049            2.6416053            3.2512065 
       West Virginia            Wisconsin              Wyoming 
           1.1430023            4.0640081            1.2446025

## create percentage table with categories of dimension type variable
dimtype_table <- table(pregvacc4$`Dimension Type`)
percentage_table3 <- prop.table(dimtype_table) * 100
print(percentage_table3)


               Age Race and Ethnicity 
          55.65151           44.34849

## create percentage table with categories of dimension variable
dim_table <- table(pregvacc4$Dimension)
percentage_table4 <- prop.table(dim_table) * 100
print(percentage_table4)


                           18 Years                          1824 Years 
                          14.020828                           13.944628 
                         2534 Years                            35 Years 
                          14.020828                           13.665227 
                  Black NonHispanic                            Hispanic 
                           9.728219                           10.820422 
Other or Multiple Races NonHispanic                   White NonHispanic 
                           9.931420                           13.868428

To explore the continuous variables, I will make plots to see if they are approximately normal. The mean of the vaccination estimate among pregnant women is 6.23%, and the standard deviation is 1.497. I had to convert the estimate variable to a factor to create a bar graph that displayed each of the unique discrete observations. The distribution looks approximately normal, but it is slightly skewed to the left.

## load packages 
library(ggplot2)

## find summary values while variable is numeric 
mean(pregvacc4$`Estimate (%)`)

[1] 6.233548

sd(pregvacc4$`Estimate (%)`)

[1] 1.496931

## convert variable to factor to be visualized
pregvacc4 <- pregvacc4 %>% 
                mutate(`Estimate (%)` = as.factor(`Estimate (%)`))
class(pregvacc4$`Estimate (%)`) ## check if mutation worked

[1] "factor"

## create visualization of the distribution of estimate (%) variable
ggplot(pregvacc4, aes(x=`Estimate (%)`)) + 
    geom_bar(position = "dodge") + 
    labs(x = "Estimate of Vaccination Percentage",
         y= "Frequency")

I would expect the lower and upper bound of the 95% confidence interval (CI) to have the same distribution as the estimate variable because the estimate variable is used to calculate the CI. Before converting the variables to a factor type to create the bar graph, I found the mean and standard deviation for each one while the variable was still numeric. The mean for the lower bound of the 95% CI 5.56%, and the standard deviation was 1.53. The mean for the upper bound of the 95% CI is 6.86%, and the standard deviation is 1.46. After creating two more bar graphs using the variables as factors, my expectations were met, and each distribution for the bounds of the confidence interval are approximately normal with a slight skew to the left.

## find summary values while variable is numeric 
mean(pregvacc4$`95% CI Lower`)

[1] 5.555558

sd(pregvacc4$`95% CI Lower`)

[1] 1.531577

## find summary values while variable is numeric
mean(pregvacc4$`95% CI Upper`)

[1] 6.86397

sd(pregvacc4$`95% CI Upper`)

[1] 1.463774

## convert variable to factor to be visualized 
pregvacc4 <- pregvacc4 %>% 
                mutate(`95% CI Lower` = as.factor(`95% CI Lower`))
class(pregvacc4$`95% CI Lower`) ## check if conversion worked

[1] "factor"

##create visualization of the distribution of 95% CI Lower variable
ggplot(pregvacc4, aes(x=`95% CI Lower`)) + 
    geom_bar(position = "dodge") + 
    labs(x = "Lower Bound of 95% Confidence Interval for Estimate",
         y= "Frequency")

## convert variable to factor to be visualized 
pregvacc4 <- pregvacc4 %>% 
                mutate(`95% CI Upper` = as.factor(`95% CI Upper`))
class(pregvacc4$`95% CI Upper`) ## check if conversion worked

[1] "factor"

##create visualization of the distribution of 95% CI Upper variable
ggplot(pregvacc4, aes(x=`95% CI Upper`)) + 
    geom_bar(position = "dodge") + 
    labs(x = "Upper Bound of 95% Confidence Interval for Estimate",
         y= "Frequency")

Cassia Roth created this section.

First, lets make sure we have all necessary packages loaded.

#Load required packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ purrr     1.0.2
✔ lubridate 1.9.2     ✔ tibble    3.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)


Attaching package: 'skimr'

The following object is masked from 'package:naniar':

    n_complete

library(gtsummary)

Next, I asked ChatGPT the following prompt: Can you write R code that generates a dataset of 3937 observations of vaccination status for women who recently gave birth. Women are between 18 and 35 years. The dataset should include seven variables (vaccine, state, year, age, race/ethnicity, estimate of vaccine coverage in percentage, 95% CI lower, 95% CI upper). 66% of the vaccination observations should be influenza vaccines and 33 percent should be Tdap vaccines. Then, given Table 4 Tessa created above, I put that into ChatGPT, so it should know the percentage of observations for each state. Then, I asked it to add code so that 50% of observations were in the age groups 18-24 and 50% were in the age groups 25-34; and that 21% of observations were Black-non Hispanic, 25% were Hispanic, 31% were White non-Hispanic, and 23% were Other. Last, I asked it to make sure the years for the data were between 2012 and 2021. I also asked it to make sure that the mean vaccination coverage was 6.23% with a standard deviation of 1.497. I finally asked it to calculate the 95% confidence interval for the mean vaccination coverage and display it. Calculating the confidence intervals took quite a bit of back and forth with ChatGPT.

Note: I removed the variable geography type (state), since that wasn’t really necessary to duplicate the data. I also combined the age categories 18 and 18-24 and 25-34 and 35 for ease of use.

This is the fully workable code it produced.

#Set seed for reproducibility
set.seed(123)

#Define number of observations to generate
n_observations <- 3937 #based on the observations of 9 variables stated above

# Define the percentage of observations for each vaccine type
percent_influenza <- 0.66
percent_tdap <- 0.33

# Generate vaccine types based on the specified percentages
vaccine_types <- sample(c("Influenza", "Tdap"), size = n_observations, 
                        replace = TRUE, prob = c(percent_influenza, percent_tdap))

# Define the states and their corresponding percentages of observations
states <- c("Alabama", "Alaska", "Arizona", "Arkansas", "Colorado", "Connecticut",
            "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Illinois",
            "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
            "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana",
            "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina",
            "North Dakota", "NYCity of New York", "NYRest of state", "Ohio", "Oklahoma", "Oregon",
            "Pennsylvania", "Puerto Rico", "Rhode Island", "South Dakota", "Tennessee", "Texas",
            "United States", "Utah", "Vermont", "Virginia", "Washington", "West Virginia",
            "Wisconsin", "Wyoming")

percentages <- c(1.2446025, 1.5494031, 0.1778004, 1.9558039, 3.0988062, 1.6256033,
                 4.0132080, 0.7112014, 0.3556007, 1.2192024, 2.1082042, 2.8194056,
                 0.2032004, 2.3114046, 0.8890018, 0.6096012, 2.2606045, 1.2954026,
                 1.8288037, 3.2512065, 2.9464059, 1.8288037, 1.2192024, 3.4798070,
                 1.6256033, 2.8448057, 1.6510033, 2.0320041, 1.9558039, 3.2004064,
                 0.6096012, 0.8128016, 3.4544069, 2.7178054, 0.5588011, 3.0226060,
                 1.4224028, 3.8608077, 0.6350013, 2.0320041, 1.0160020, 1.1176022,
                 0.8128016, 4.0640081, 2.7686055, 2.4384049, 2.6416053, 3.2512065,
                 1.1430023, 4.0640081, 1.2446025)

# Normalize percentages to sum to 1
percentages <- percentages / sum(percentages)

# Generate state observations based on the specified percentages
state_obs <- sample(states, size = n_observations, replace = TRUE, prob = percentages)

# Generate random years between 2012 and 2021
years <- sample(2012:2021, size = n_observations, replace = TRUE)

# Generate random ages between 18 and 35
age_groups <- sample(c("18-24", "25-34"), size = n_observations, replace = TRUE, prob = c(0.5, 0.5))

# Generate random race/ethnicity
race_ethnicity <- sample(c("Black-non Hispanic", "Hispanic", "White non-Hispanic", "Other"), 
                         size = n_observations, replace = TRUE, prob = c(0.21, 0.25, 0.31, 0.23))

# Generate random estimates of vaccine coverage in percentage
vaccine_coverage <- rnorm(n_observations, mean = 6.23, sd = 1.497)

# Calculate the mean and standard deviation of the generated vaccine coverage data
mean_coverage <- mean(vaccine_coverage)
sd_coverage <- sd(vaccine_coverage)

# Calculate the standard error (SE) of the mean
se_mean <- sd_coverage / sqrt(n_observations)

# Calculate the margin of error (ME) for a 95% confidence level (assuming normal distribution)
# For a 95% confidence interval, the critical z-value is approximately 1.96
z_value <- qnorm(0.975)  # Two-tailed

# Initialize variables to store lower and upper bounds of the confidence intervals
lower_ci <- numeric(n_observations)
upper_ci <- numeric(n_observations)

#Setting fixed CIs for each observation to avoid bootstrapping
lower_ci <- vaccine_coverage - z_value * se_mean
upper_ci <- vaccine_coverage + z_value * se_mean

# Combine all variables into a data frame
vaccination_data <- data.frame(vaccine = vaccine_types,
                               state = state_obs,
                               year = years,
                               age_group = age_groups,
                               race_ethnicity = race_ethnicity,
                               estimate_coverage_percentage = vaccine_coverage,
                               lower_ci = lower_ci,
                               upper_ci = upper_ci)

# Display the first few rows of the dataset and get summary
head(vaccination_data)

    vaccine              state year age_group     race_ethnicity
1 Influenza          Wisconsin 2018     25-34              Other
2      Tdap            Vermont 2012     18-24           Hispanic
3 Influenza           Maryland 2019     25-34              Other
4      Tdap            Alabama 2021     25-34 White non-Hispanic
5      Tdap      United States 2021     18-24           Hispanic
6 Influenza NYCity of New York 2020     25-34 Black-non Hispanic
  estimate_coverage_percentage  lower_ci  upper_ci
1                     8.470043  8.423233  8.516853
2                     5.660458  5.613648  5.707267
3                    10.149212 10.102402 10.196022
4                     6.185569  6.138759  6.232379
5                     4.994664  4.947854  5.041473
6                     5.991605  5.944795  6.038414

skim(vaccination_data)

Data summary
Name	vaccination_data
Number of rows	3937
Number of columns	8
_______________________
Column type frequency:
character	4
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
vaccine	1	4	9	2
state	1	4	20	51
age_group	1	5	5	2
race_ethnicity	1	5	18	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2016.48	2.9	2012.00	2014.00	2016.00	2019.00	2021.00	▇▇▇▇▇
estimate_coverage_percentage	1	6.23	1.5	1.22	5.24	6.25	7.27	10.87	▁▃▇▅▁
lower_ci	1	6.19	1.5	1.17	5.20	6.20	7.22	10.83	▁▃▇▅▁
upper_ci	1	6.28	1.5	1.27	5.29	6.29	7.32	10.92	▁▃▇▅▁

Then, I plotted the mean and standard distribution of the vaccine coverage variable. Although my tables are in a different format than Tessa’s above, they have a similar distribution.

#Find summary values
mean(vaccination_data$estimate_coverage_percentage)

[1] 6.233056

sd(vaccination_data$estimate_coverage_percentage)

[1] 1.498551

mean(vaccination_data$lower_ci)

[1] 6.186246

sd(vaccination_data$lower_ci)

[1] 1.498551

mean(vaccination_data$upper_ci)

[1] 6.279866

sd(vaccination_data$upper_ci)

[1] 1.498551

# Plotting the distribution of vaccine coverage percentage
ggplot(vaccination_data, aes(x = estimate_coverage_percentage)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Vaccine Coverage Percentage",
       x = "Vaccine Coverage Percentage",
       y = "Frequency") +
  theme_minimal()

# Plotting the distribution of mean and standard deviation of vaccine coverage percentage
ggplot(vaccination_data, aes(x = estimate_coverage_percentage)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  geom_vline(aes(xintercept = mean(estimate_coverage_percentage, na.rm = TRUE)), 
             color = "red", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = mean(estimate_coverage_percentage, na.rm = TRUE) + 
                   sd(estimate_coverage_percentage, na.rm = TRUE)), 
             color = "green", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = mean(estimate_coverage_percentage, na.rm = TRUE) - 
                   sd(estimate_coverage_percentage, na.rm = TRUE)), 
             color = "green", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Vaccine Coverage Percentage with Mean and Standard Deviation",
       x = "Vaccine Coverage Percentage",
       y = "Frequency") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Finally, I will plot the distribution of both the lower and upper bounds of the 95% CI. As you can see from the output, the distributions match those of the mean and standard deviation of the vaccine coverage variable.

# Plotting the distribution of lower 95% CI
ggplot(vaccination_data, aes(x = lower_ci)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black", alpha = 0.5) +
  labs(title = "Distribution of Lower 95% Confidence Interval",
       x = "Lower 95% CI",
       y = "Frequency") +
  theme_minimal()

# Plotting the distribution of upper 95% CI
ggplot(vaccination_data, aes(x = upper_ci)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black", alpha = 0.5) +
  labs(title = "Distribution of Upper 95% Confidence Interval",
       x = "Upper 95% CI",
       y = "Frequency") +
  theme_minimal()