1) Introduction & Problem Statement

In my computer science senior seminar course, I had the opportunity to produce geographic data that outlines the time it takes for a fire department apparatus (truck or tanker) to reach anywhere in Vermont, from any fire station. The project, titled “Fire Resource Area Coverage and Time to Response” (FRACTR) and available for display at https://smokenmaps.com, sought to examine fire resource coverage (including response times from stations, and availability of fire hydrants) throughout the state of Vermont.

For this data science final project, I asked myself the following two questions:

What proportion of Vermont residents are within X minutes of a fire station’s driving distance?
Of these groups of residents, what is their median household income in 2016? Do we see any noticable difference between residents who live near and far from fire stations?

To respond to these questions, I use data produced by FRACTR as well as 2010 U.S. Census data. All limitations from the original FRACTR project may be found in the project report PDF (link at the bottom of the website homepage), and these limitations are consequently inherited by this final project.

2) Original Data

The FRACTR data concerning response times comes in four different geojson files. Each contain geometric polygon features outlining areas that are within X minutes of a fire station’s driving distance. These files were generated using network analysis methods, and considering that a fire apparatus drives at the speed limit of any given road. There are four different files, each of which contain polygons highlighting areas within 2, 5, 10 and 20 minutes of any fire station in Vermont. Of the data avilable from the FRACTR project, for the sake of simplicity, the files were used are the ones in which it is considered that the closest fire station is the first to respond to the scene (visit https://smokenmaps.com/esn.html to learn more abot why this isn’t quite the case). These geojson files are titled 2.geojson, 5.geojson, 10.geojson and 20.geojson and may be found in the project’s Website/data repository.

From the 2010 U.S. Census, the data used is:

Population density in 2010, at the tract level
Median household income of residents in 2016.

Both datasets are made avilable by the Opportunity Atlas.

3) Methods and Results

Throughout this report, because I will be performing the same computation on four different files to obtain different results, I have created a number of functions that will do this work.

Response Time Analysis

We will begin by answering the first question: what proportion of Vermont residents live within 2, 5, 10 or 20 minutes of a fire station?

# Takes in a geojson file, returns the SPDF object
produce_spdf <- function(file_name) {
  
  # Fetch response time polygon data
  time <- st_read(file_name)
  
  # Convert to simple feature data frame
  time_sf <- st_as_sf(time)
  
  # Isolate out Spatial Lines from Spatial Polygons
  types <- vapply(sf::st_geometry(time_sf), function(x) {
    class(x)[2]
  }, "")
  
  polygons_in_file <- time_sf[ grepl("*POLYGON", types), ]
  time_polys <- as(polygons_in_file, "Spatial")
  
  return(time_polys)
}

Below, we call the above function which ingests the geojson files containing the response time polygons.

## Reading layer `2' from data source `/Users/john/Documents/College/Courses/Math/Intro to Data Science/fractr/2.geojson' using driver `GeoJSON'
## Simple feature collection with 282 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -73.36138 ymin: 42.74678 xmax: -71.49741 ymax: 45.01323
## Geodetic CRS:  WGS 84

## Reading layer `5' from data source `/Users/john/Documents/College/Courses/Math/Intro to Data Science/fractr/5.geojson' using driver `GeoJSON'
## Simple feature collection with 282 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.376 ymin: 42.73313 xmax: -71.49741 ymax: 45.01323
## Geodetic CRS:  WGS 84

## Reading layer `10' from data source `/Users/john/Documents/College/Courses/Math/Intro to Data Science/fractr/10.geojson' using driver `GeoJSON'
## Simple feature collection with 282 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.41838 ymin: 42.72728 xmax: -71.49741 ymax: 45.01478
## Geodetic CRS:  WGS 84

## Reading layer `20' from data source `/Users/john/Documents/College/Courses/Math/Intro to Data Science/fractr/20.geojson' using driver `GeoJSON'
## Simple feature collection with 282 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.42242 ymin: 42.72728 xmax: -71.49741 ymax: 45.01634
## Geodetic CRS:  WGS 84

Next, we import the population density data per tract, as well as the census tract geometries. We subtract lakes and bodies of water from the tracts, as the population density would be falsely inflated (the population density provided per tract only applies to habitable areas of land, excluding water, yet the tract geometries include water). Lakes and bodies of water are obtained via the Vermont Open Geodata Portal API’s “Lakes Inventory” dataset.

# Import Population density per square mile per tract, across US
us_tract_data <- read_csv("tract_popdensity2010.csv")

# Tract feature geometries in Vermont
vt_tract_geom <- shapefile("cb_2019_50_tract_500k/cb_2019_50_tract_500k.shp")

# Convert GEOID to numerical
vt_tract_geom@data <- vt_tract_geom@data %>%
  mutate(GEOID = as.numeric(vt_tract_geom@data$GEOID))

# Retain only Vermont tracts, merge population data per tract with tract geometries
vt_tract_geom@data <- left_join(vt_tract_geom@data, us_tract_data, by = c("GEOID" = "tract"))

# Import Vermont lakes
lakes <- geojson_read("https://opendata.arcgis.com/datasets/9b10dbb0fd8a4a8b81c1aae453aea20c_208.geojson", what = "sp")

# Standardize geometry projections
projection(lakes) <- projection(vt_tract_geom)

# Subtract lake areas from tracts
# This creates a single wide polygon of state of VT, minus lakes
state_no_lakes <- gDifference(vt_tract_geom, lakes)

# Intersect Vermont tracts and lakes (preserves SPDF data frame)
tracts_final <- intersect(vt_tract_geom, state_no_lakes)

# Separately calculate approximate total state population
duplicates_tracts <- duplicated(tracts_final@data)
tracts_final <- tracts_final[!duplicates_tracts, ]

We now analyze the population density that lives within 2, 5, 10 and 20 minutes of any fire station in Vermont.

# Returns the total population living within a given response time number (e.g., 5 minutes)
analyze_time <- function(polys) {
  
  # Dissolve the response time polygons (one for every fire station) into one large polygon feature
  dissolved_time_polys <- gUnaryUnion(polys)
  
  # Intersect tract polygons with response time polygons to produce parcels
  parcels <- intersect(tracts_final, dissolved_time_polys)
  
  # Calculate area in square miles of new parcels (convert from square meters)
  parcels$area_sqmile <- area(parcels) / 2589988.11
  
  # Remove any duplicate rows in the SPDF
  duplicates <- duplicated(parcels@data)
  parcels <- parcels[!duplicates, ]
  
  # Multiply tract area by tract population density (# residents per square mile)
  # to get total # inhabitants per parcel
  parcels@data <- parcels@data %>%
    mutate(pop_in_parcel = round(Population_Density_in_2010 * area_sqmile))
  
  # Sum up total population of all parcels
  total_pop_time <- parcels@data %>%
    summarize(sum(pop_in_parcel)) %>%
    pull()
  
  return(total_pop_time)
}

We compute the population density within a given response time of a fire station.

Important: For every response time, we preserve areas that are only in the response time and not in any shorter response times. For instance, for the 5 minute response time polygons, we subtract the 2 minute polygons from the wider 5 minute polygons to create ‘doughnut’ polygon shapes. This allows to clearly distinguish the population density of every response time area.

# Simple function to subtract the inner time polygons from outter time polygons
isolate_time <- function(outter_polys, inner_polys) {
  
  # Dissolve the response time polys into one large poly
  dissolved_outter_polys <- gUnaryUnion(outter_polys)
  
  dissolved_inner_polys <- gUnaryUnion(inner_polys)
  
  outter_only_polys <- gDifference(dissolved_outter_polys, dissolved_inner_polys)
  return(outter_only_polys)
}

# Compute total population per tract
tracts_final@data <- tracts_final@data %>%
  mutate(area_sqmile = area(tracts_final) / 2589988.11) %>%
  mutate(population_in_tract = round(Population_Density_in_2010 * area_sqmile))

# Get total vermont population figure
total_pop_vt <- tracts_final@data %>%
  summarize(sum(population_in_tract)) %>%
  pull()

# Subtract inner from outter response time polygons
only_5min_polys <- isolate_time(spdf_5, spdf_2)
only_10min_polys <- isolate_time(spdf_10, spdf_5) # 5 min polys include 2 min polys
only_20min_polys <- isolate_time(spdf_20, spdf_10) # 10 min polys include 5 and 2min polys

pop_in_2 <- analyze_time(spdf_2)
pop_in_5 <- analyze_time(only_5min_polys)
pop_in_10 <- analyze_time(only_10min_polys)
pop_in_20 <- analyze_time(only_20min_polys)

# Make a single data frame with data from all four response time bins
results_time <- data.frame(c("0 to 2", "2 to 5", "5 to 10", "10 to 20"),
                            c(pop_in_2, pop_in_5, pop_in_10, pop_in_20),
                            c(pop_in_2 / total_pop_vt * 100,
                              pop_in_5 / total_pop_vt * 100,
                              pop_in_10 / total_pop_vt * 100,
                              pop_in_20 / total_pop_vt * 100))

colnames(results_time) <- c("Response Time Area (mins)",
                             "Total pop. in response time",
                             "Total pop. in response time as %")
view(results_time)

options(scipen = 10000)

show(results_time)

##   Response Time Area (mins) Total pop. in response time
## 1                    0 to 2                       90479
## 2                    2 to 5                      193794
## 3                   5 to 10                      223349
## 4                  10 to 20                       93733
##   Total pop. in response time as %
## 1                         14.31886
## 2                         30.66909
## 3                         35.34635
## 4                         14.83382

# Base color vector for polygon colors: these are the same color hex codes used
# for the various response times on the FRACTR project website.
color_vec = c('#1e1656', '#8b2a57', '#c26b4c', '#cbbb65')

# Plot the results to a graph
results_time %>%
  ggplot(aes(x = factor(`Response Time Area (mins)`, 
                        levels = `Response Time Area (mins)`),
             y = `Total pop. in response time`)) +
  ggtitle("Population per Fire Station Response Time Area") +
  geom_bar(stat = "identity",
           fill = color_vec) +
  geom_text(aes(label = paste(round(`Total pop. in response time as %`, 2), "%")), # Show values on bars
            vjust=-0.4) +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        plot.title = element_text(hjust = 0.5)) +
  ylab(label = "Total pop. in response time area") +
  xlab(label = "Response Time (mins)")

# Display the total vermont population, produced by summing up population density of the tracts (minus lakes).
show(total_pop_vt)

## [1] 631887

We’ve computed the total population of Vermont as provided by the 2010 tracts, minus lakes, to find an estimate of roughly 631887 as the total population of Vermont. This figure is quite close to the Vermont population figure of 623,989 provided by the 2010 U.S. Census. Thus, we can feel generally confident that the method of subtracting lakes produces reliable result estimates.

# Show percentage of population within 20 minutes in Vermont
results_time %>%
  summarize("Population density within 20 mins" = sum(`Total pop. in response time`),
  "Population density within 20 mins as %" =
    paste(round(sum(`Total pop. in response time as %`), 2), "%")
  )

##   Population density within 20 mins Population density within 20 mins as %
## 1                            601355                                95.17 %

Income Analysis

After having inspected population density by response time area, we now look at median household in 2016 as provided by the U.S. Census per response time area.

analyze_income <- function(polys) {
  
  # Dissolve the response time polys into one large poly (if not already done)
  dissolved_time_polys <- gUnaryUnion(polys)
  
  # Intersect tract polygons with response time polygons to produce parcels
  parcels <- intersect(vt_tract_geom, dissolved_time_polys)
  
  # Remove any duplicate rows in the SPDF
  duplicates <- duplicated(parcels@data)
  parcels <- parcels[!duplicates, ]
  
  
  avg_median_hhold <- parcels@data %>%
    na.omit() %>% # remove Burlington airport and any other NAs
    summarize(mean(`Median_Hhold._Income_of_Residents_in_2012-16`)) %>%
    pull()
  
  return(avg_median_hhold)
}

# Reimport tract feature geometries in Vermont
vt_tract_geom <- shapefile("cb_2019_50_tract_500k/cb_2019_50_tract_500k.shp")

# Import Median Household Income in 2016
med_hhinc <- read_csv("tract_med_hhinc2016_real.csv")

# Convert GEOID to numerical
vt_tract_geom@data <- vt_tract_geom@data %>%
  mutate(GEOID = as.numeric(vt_tract_geom@data$GEOID))

# Retain only Vermont tracts, merge median household income per tract with tract geometries
vt_tract_geom@data <- left_join(vt_tract_geom@data, med_hhinc, by = c("GEOID" = "tract"))

# Subtract inner from outter response time polygons
only_5min_polys <- isolate_time(spdf_5, spdf_2)
only_10min_polys <- isolate_time(spdf_10, spdf_5) # 5 min polys include 2 min polys
only_20min_polys <- isolate_time(spdf_20, spdf_10) # 10 min polys include 5 and 2min polys

income_2 <- analyze_income(spdf_2)
income_5 <- analyze_income(only_5min_polys)
income_10 <- analyze_income(only_10min_polys)
income_20 <- analyze_income(only_20min_polys)

Now that we’ve fetched the income per response time area, we compile the results in a table and output them in a graph.

# Make a single data frame with data of all four response time bins by income
results_income <- data.frame(c("0 to 2", "2 to 5", "5 to 10", "10 to 20"),
                            c(income_2, income_5, income_10, income_20))

colnames(results_income) <- c("Response Time Area (mins)",
                             "Median household income in 2016")

show(results_income)

##   Response Time Area (mins) Median household income in 2016
## 1                    0 to 2                        56403.63
## 2                    2 to 5                        56315.00
## 3                   5 to 10                        57392.57
## 4                  10 to 20                        57834.28

results_income %>%
  ggplot(aes(x = factor(`Response Time Area (mins)`, 
                        levels = `Response Time Area (mins)`),
             y = `Median household income in 2016`,
             group = 1)) +
  ggtitle("Median Household Income in 2016\n per Response Time Area") +
  geom_path(stat = "identity") +
  geom_point(stat = "identity",
           color = color_vec,
           size = 4) +
  geom_text(aes(label = paste("$", round(`Median household income in 2016`, 2))), # Show values by points
            hjust = -0.35,
            vjust = -.1) +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        plot.title = element_text(hjust = 0.5)) +
  ylab(label = "Median household income in 2016") +
  xlab(label = "Response Time (mins)")

Discussion

An important limitation is that, concerning the population density data provided by the U.S. Census, we’ve assumed uniform distribution of the population across census tracts, when it is possible and likely that the distribution of residents within a tract is, in fact, unevenly balanced.

Looking at the population density per response time graph, it seems as though the greatest population density is located in the 5 to 10 minute response time area. A theory for this is that Vermont is a generally rural state, and that fire stations are generally situated in the middle of a town to reach the widest audience possible.

It looks as though roughly 95 % of Vermont residents live within 20 minutes of a fire station, which is a reassuring sign. Looking at the maps given by the FRACTR project, it seems that portins of the Green Mountains and Northeast Kingdom of the state are the most uncovered.

In parallel, we are able to answer the second question concerning median income per response time. The graph shows that there is a difference of $1,519.28 in annual income between the highest (10 to 20 mins) and lowest (2 to 5 mins) response times. Assuming that fire stations are at the heart of town (which is not always true although seems to be the case in many Vermont towns), this difference suggests that folks living on the edges of town, but not any more than five minutes away from stations, earn less income than those living in the countryside, at over ten minutes away from stations. The general trend suggested by the graph is that households in cities and towns earn less than households in the countryside. Another hypothesis is that household income is greater for households housing multiple persons. Such households might be found in a family-oriented environment, such as suburbs or the countryside, whereas inner cities might be home to more single individuals living in a household alone.

Of note is that an average of median household incomes of all tracts, intersected by response time, was performed. Duplicate rows were removed to avoid double counting. Moreover, the data does not account for undocumented persons who may not have been counted, or who seek to avoid contact with federal authorities. Some might argue that these workers choose to live on the fringe of society to avoid being detained by authorities, and primarily work in rural areas, such as on farms — consequently, these individuals are not being counted, and the median income in rural areas (or areas that have a higher density of undocumented persons) may be skewed.

Further work should seek to look at other distinguishing criteria between these response time areas, such as ethnicity or education level. Moreover, further work should examine the typical location of fire stations, whether in the heart of town or further away. A good way to achieve this might be to measure the distance between townhhalls or center-city markers and fire stations. Furthermore, higher resolution data would be interesting to have, such as response time areas for every minute rather than 2, 5, 10 and 20.

Final Project: Fire Station Coverage